Data Scientist: Is This Really Science Or Just Pretension?

Blog post info and actions

Blog post body

James Kobielus

Every true scientist must also be a type of data scientist, although not all self-proclaimed data scientists are in fact true scientists.

True science is nothing without observational data. Without a fine-grained ability to sift, sort, structure, categorize, analyze, and present data, the scientist can’t bring coherence to their inquiry into the factual substrate of reality. Just as critical, a scientist who hasn’t drilled down into the heart of their data can’t effectively present or defend their findings.

Fundamentally, science is a collaborative activity of building and testing interpretive frameworks through controlled observation. At the heart of any science are the “controls” that help you isolate the key explanatory factors from those with little or no impact on the dependent variables of greatest interest. All branches of science rely on logical controls, such as adhering to the core scientific methods of hypothesis, measurement, and verification, as vetted through community controls such as peer review, refereed journals, and the like. Some branches of science, such as chemistry, rely largely on experimental controls. Some, such as astronomy, rely on the controls embedded in powerful instrumentation like space telescopes. Still others, such as the social sciences, may use experimental methods but rely principally on field observation and on statistical methods for finding correlations in complex behavioral data.

Read more

Data Scientist: Do You Truly Need Big Data?

Blog post info and actions

Blog post body

James Kobielus

Data science has historically had to content itself with mere samples. Few data scientists have had the luxury of being able amass petabytes of data on every relevant variable of every entity in the population under study.

The big data revolution is making that constraint a thing of the past. Think of this new paradigm as “whole-population analytics,” rather than simply the ability to pivot, drill, and crunch into larger data sets. Over time, as the world evolves toward massively parallel approaches such as Hadoop, we will be able to do true 360-degree analysis. For example, as more of the world’s population takes to social networking and conducts more of its lives in public online forums, we will all have comprehensive, current, and detailed market intelligence on every demographic available as if it were a public resource. As the price of storage, processing, and bandwidth continue their inexorable decline, data scientists will be able to keep the entire population of all relevant polystructured information under their algorithmic microscopes, rather than have to rely on minimal samples, subsets, or other slivers.

Clearly, the big data revolution is fostering a powerful new type of data science. Having more comprehensive data sets at our disposal will enable more fine-grained long-tail analysis, microsegmentation, next best action, customer experience optimization, and digital marketing applications. It is speeding answers to any business question that requires detailed, interactive, multidimensional statistical analysis; aggregation, correlation, and analysis of historical and current data; modeling and simulation, what-if analysis, and forecasting of alternative future states; and semantic exploration of unstructured data, streaming information, and multimedia.

Read more

Data Scientist: What Skills Does It Require?

Blog post info and actions

Blog post body

James Kobielus

Data scientists are a curious breed. The term encompasses a wide range of specialties, all of which rely on statistical algorithms and interactive exploration tools to uncover nonobvious patterns in observational data.

Who belongs in this category? Clearly, the “quants” are fundamental. Anybody who builds multivariate statistical models, regardless of the tool they use, might call themselves a data scientist. Likewise, data mining specialists who look for hidden patterns in historical data sets — structured, unstructured, or some blend of diverse data types — may certainly use the term. Furthermore, a predictive modeler or any analyst who builds fact-based what-if simulations is a data scientist par excellence. We should also include anybody who specializes in constraint-based optimization, natural language processing, behavioral analytics, operations research, semantic analysis, sentiment analysis, and social network analysis.

But these jobs are only one-half of the data-science equation. The “suits” are also fundamental. Any business domain specialist who works with any of the tools and approaches listed above may consider him- or herself a data scientist. In fact, if one and the same person is a black belt in SAS, SPSS, R, or other statistical tools, and also an expert in marketing, customer service, finance, supply chain, or other business specialties, they are a data scientist par excellence.

Both of these skill sets are fundamental to high-quality data science. Lacking statistical expertise, you can’t understand which are the most appropriate algorithms and approaches to make the foundation of your statistical models. Lacking business domain expertise, you can’t identify the most valid variables and appropriate data sets to build into your models around.

Read more

Data Scientist: Important New Role Or Trendy Job-Title Inflation?

Blog post info and actions

Blog post body

James Kobielus

The big data universe revolves around this seemingly new role called “data scientist.” For IT professionals who are just now beginning to explore big data, the notion of a data scientist may seem a bit trendy, hence suspect. How does it differ from such familiar jobs as statistical analyst, data miner, predictive modeler, and content analytics specialist?

Yes, data scientist is a trendy new job title to emboss on your business card. But it’s also a very useful new term for referring to a wide range of advanced analytics functions that heretofore have had no consensus category label. The term recognizes that advanced analytics developers, like scientists generally, spend their careers exploring new data for powerful insights that may not be obvious on first glance.

Indeed, one might define a data scientist as someone who uses statistical algorithms and interactive exploration tools to uncover nonobvious patterns in observational data. This definition is broad enough to encompass a wide range of data scientists doing various types of analyses against many data types. The tools may be usable by any intelligent person, or they may be so specialized and abstruse that you practically need a Ph.D. in higher mathematics to get started. The underlying algorithms may be limited to the most common multivariate regression approaches or may include the latest advances in artificial intelligence and machine learning. The exploration may be highly visual, or it may also involve trial-and-error iteration through complex statistical models.

Read more

Hadoop: Future Of Enterprise Data Warehousing? Are You Kidding?

Blog post info and actions

Blog post body

James Kobielus

I kid you not.

What’s clear is that Hadoop has already proven its initial footprint in the enterprise data warehousing (EDW) arena: as a petabyte-scalable staging cloud for unstructured content and embedded execution of advanced analytics. As noted in a recent blog post, this is in fact the dominant use case for which Hadoop has been deployed in production environments.

Yes, traditional (Hadoop-less) EDWs can in fact address this specific use case reasonably well — from an architectural standpoint. But given that the most cutting-edge cloud analytics is happening in Hadoop clusters, it’s just a matter of time — one to two years, tops — before all EDW vendors bring Hadoop into their heart of their architectures. For those EDW vendors who haven’t yet fully committed to full Hadoop integration, the growing real-world adoption of this open-source approach will force their hands.

Where the next-generation EDW is concerned, the petabyte staging cloud is merely Hadoop’s initial footprint. Enterprises are moving rapidly toward the EDW as the hub for all advanced analytics. Forrester strongly expects vendors to incorporate the core Hadoop technologies — especially MapReduce, Hadoop Distributed File System, Hive, and Pig — into their core architectures. Again, the impressive growth in MapReduce as a lingua franca for predictive modeling, data mining, and content analytics will practically compel EDW vendors to optimize their platforms for MapReduce, alongside high-performance support for SAS, SPSS, R, and other statistical modeling languages and formats. We see clear signs that this is already happening, as with EMC Greenplum’s recent announcement of a Hadoop product family and indications from some of that company’s competitors that they have similar near-term road maps.

Read more

Hadoop: What Are These Big Bad Insights That Need All This Nouveau Stuff?

Blog post info and actions

Blog post body

James Kobielus

Problems don’t care how you solve them. The only thing that matters is that you do indeed solve them, using any tools or approaches at your disposal.

When people speak of “Big Data,” they’re referring to problems that can best be addressed by amassing massive data sets and using advanced analytics to produce “Eureka!” moments. The issue of what approach — Hadoop cloud, enterprise data warehouse (EDW), or otherwise — gets us to those moments is secondary.

It’s no accident that Big Data mania has also stimulated a vogue in “data scientists.” Many of the core applications of Hadoop are scientific problems in linguistics, medicine, astronomy, genetics, psychology, physics, chemistry, mathematics, and artificial intelligence. In fact, Yahoo’s scientists not only had a predominant role in developing Hadoop but — as exploratory problem-solvers — they are active participants in Yahoo’s efforts to evolve Hadoop into an even more powerful scientific cloud platform.

The problems that are best suited to Hadoop and other Big Data platforms are scientific in nature. What they have in common is a need for analytical platforms and tools that can rapidly scale out to the petabyte level and support the following core features: 

  • Detailed, interactive, multivariate statistical analysis
  • Aggregation, correlation, and analysis of historical and current data
  • Modeling and simulation, what-if analysis, and forecasting of alternate future states
  • Semantic mining of unstructured data, streaming information, and multimedia
Read more

Hadoop: What Is It Good For? Absolutely . . . Something

Blog post info and actions

Blog post body

James Kobielus

Enterprises have options. One of the questions I asked the firms I interviewed as Hadoop case studies for my upcoming Forrester report is whether they considered using the tried and true approach of a petabyte-scale enterprise data warehouse (EDW). It’s not a stretch, unless you are a Hadoop bigot and have willfully ignored the commercial platforms that already offer shared-nothing massively parallel processing for in-database advanced analytics and high-performance data management. If you need to brush up, check out my recent Forrester Wave™ for EDW platforms.

Many of the case study companies did in fact consider an EDW like those from Teradata and Oracle. But they chose to build out their Big Data initiatives on Hadoop for many good reasons. Most of those are the same reasons any user adopts any open-source platform: By using Apache Hadoop, they could avoid paying expensive software licenses; give themselves the flexibility to modify source code to meet their evolving needs; and avail themselves of leading-edge innovations coming from the worldwide Hadoop community.

But the basic fact is that Hadoop is not a radically new approach to processing extremely scalable data analytics. You can use a high-end EDW to do most of what you can do with Hadoop with all the core features — including petabyte scale-out, in-database analytics, mixed-workload support, cloud-based deployment, and complex data sources — that characterize most real-world Hadoop deployments. And the open-source Apache Hadoop code base, by its devotees’ own admission, still lacks such critical features as the real-time integration and robust high availability you find in EDWs everywhere.

Read more

Hadoop: Is It Soup Yet?

Blog post info and actions

Blog post body

James Kobielus

Most Hadoop-related inquiries from Forrester clients come to me. These have moved well beyond the “What exactly is Hadoop?” phase to the stage where the dominant query is “Which vendors offer robust Hadoop solutions?”

What I tell Forrester clients is that, yes, Hadoop is real, but that it’s still quite immature. On the “real” side, Hadoop has already been adopted by many companies for extremely scalable analytics in the cloud. On the “immature” side, Hadoop is not ready for broader deployment in enterprise data analytics environments until the following things happen:

  • More enterprise data warehousing (EDW) vendors adopt Hadoop. Of the vendors in my recent Forrester Wave™ for EDW platforms, only IBM and EMC Greenplum have incorporated Hadoop into the core of their solution portfolios. Other leading EDW vendors interface with Hadoop only partially and only at arm’s length. We strongly believe that Hadoop is the nucleus of the next-generation cloud EDW, but that promise is still three to five years from fruition. It’s likely that most EDW vendors will embrace Hadoop more fully in the coming year, with strategic acquisitions the likely route.
  • Early implementers converge on a core Hadoop stack. The companies I’ve interviewed as case studies indicate that the only common element in Hadoop deployments is the use of MapReduce as the modeling abstraction layer. We can’t say Hadoop is ready-to-serve soup until we all agree to swirl some common ingredients into the bubbling broth of every deployment. And the industry should clarify the reference framework within which new Hadoop specs are developed.
Read more

ERP Grows Into The Cloud: Reflections From SuiteWorld 2011

Blog post info and actions

Blog post body

Holger Kisker

Cloud computing continues to be hyped. By now, almost every ICT hardware, software, and services company has some form of cloud strategy — even if it’s just a cloud label on a traditional hosting offering — to ride this wave. This misleading vendor “cloud washing” and the complex diversity of the cloud market in general make cloud one of the most popular and yet most misunderstood topics today (for a comprehensive taxonomy of the cloud computing market, see this Forrester blog post).

Software-as-a-service (SaaS) is the largest and most strongly growing cloud computing market; its total market size in 2011 is $21.2 billion, and this will explode to $78.4 billion by the end of 2015, according to our recently published sizing of the cloud market. But SaaS consists of many different submarkets: Historically, customer relationship management (CRM), human capital management (HCM) — in the form of “lightweight” modules like talent management rather than payroll — eProcurement, and collaboration software have the highest SaaS adoption rates, but highly integrated software applications that process the most sensitive business data, such as enterprise resource planning (ERP), are the lantern-bearers of SaaS adoption today.

Read more

Join Forrester’s TweetJam On Advanced Analytics: December 15 At 12 pm US Eastern Time

Blog post info and actions

Blog post body

Holger Kisker

Are you interested in business intelligence, wonder about the future of the analytics market or have a question on advanced analytics technologies?

Then join the Forrester analysts Rob Karel, Boris Evelson, Clay Richardson, Gene Leganza, Noel Yuhanna, Leslie Owens, Suresh Vittal, William Frascarelli, David Frankland, Joe Stanhope, Zach Hofer-Shall, Henry Peyret and myself for an interactive TweetJam on Twitter about the state of advanced analytics on Wednesday, December 15th, 2010 from 12:00 p.m. – 1:00 p.m. EDT (18:00 – 19:00 CET) using the Twitter hashtag #dmjam. We’ll share the results of our recent research on the analytics market space and discuss how it will change with new technologies entering the scene and maturing over time.

Business intelligence is the fastest growing software market today as companies are driving business results based on deeper insights and better planning, and advanced analytics is the spearhead of BI technologies that can untap new dimensions of business performance. But what exactly is ‘advanced’ analytics, what technologies are available and how to efficiently use them?

Much more detailed information can be found in the blog of Forrester analyst James Kobielus who will lead us through the discussion during the TweetJam. Above you see an overview graphic listing the different elements of advanced analytics today, taken from his blog.

Here are some of the questions we want to debate during our TweetJam discussion:

  • What exactly is and isn’t advanced analytics?
  • What are the chief business applications of advanced analytics?
Read more