Data Scientist: What Skills Does It Require?

Blog post info and actions

Blog post body

James Kobielus

Data scientists are a curious breed. The term encompasses a wide range of specialties, all of which rely on statistical algorithms and interactive exploration tools to uncover nonobvious patterns in observational data.

Who belongs in this category? Clearly, the “quants” are fundamental. Anybody who builds multivariate statistical models, regardless of the tool they use, might call themselves a data scientist. Likewise, data mining specialists who look for hidden patterns in historical data sets — structured, unstructured, or some blend of diverse data types — may certainly use the term. Furthermore, a predictive modeler or any analyst who builds fact-based what-if simulations is a data scientist par excellence. We should also include anybody who specializes in constraint-based optimization, natural language processing, behavioral analytics, operations research, semantic analysis, sentiment analysis, and social network analysis.

But these jobs are only one-half of the data-science equation. The “suits” are also fundamental. Any business domain specialist who works with any of the tools and approaches listed above may consider him- or herself a data scientist. In fact, if one and the same person is a black belt in SAS, SPSS, R, or other statistical tools, and also an expert in marketing, customer service, finance, supply chain, or other business specialties, they are a data scientist par excellence.

Both of these skill sets are fundamental to high-quality data science. Lacking statistical expertise, you can’t understand which are the most appropriate algorithms and approaches to make the foundation of your statistical models. Lacking business domain expertise, you can’t identify the most valid variables and appropriate data sets to build into your models around.

Read more

Data Scientist: Important New Role Or Trendy Job-Title Inflation?

Blog post info and actions

Blog post body

James Kobielus

The big data universe revolves around this seemingly new role called “data scientist.” For IT professionals who are just now beginning to explore big data, the notion of a data scientist may seem a bit trendy, hence suspect. How does it differ from such familiar jobs as statistical analyst, data miner, predictive modeler, and content analytics specialist?

Yes, data scientist is a trendy new job title to emboss on your business card. But it’s also a very useful new term for referring to a wide range of advanced analytics functions that heretofore have had no consensus category label. The term recognizes that advanced analytics developers, like scientists generally, spend their careers exploring new data for powerful insights that may not be obvious on first glance.

Indeed, one might define a data scientist as someone who uses statistical algorithms and interactive exploration tools to uncover nonobvious patterns in observational data. This definition is broad enough to encompass a wide range of data scientists doing various types of analyses against many data types. The tools may be usable by any intelligent person, or they may be so specialized and abstruse that you practically need a Ph.D. in higher mathematics to get started. The underlying algorithms may be limited to the most common multivariate regression approaches or may include the latest advances in artificial intelligence and machine learning. The exploration may be highly visual, or it may also involve trial-and-error iteration through complex statistical models.

Read more

Hadoop: Future Of Enterprise Data Warehousing? Are You Kidding?

Blog post info and actions

Blog post body

James Kobielus

I kid you not.

What’s clear is that Hadoop has already proven its initial footprint in the enterprise data warehousing (EDW) arena: as a petabyte-scalable staging cloud for unstructured content and embedded execution of advanced analytics. As noted in a recent blog post, this is in fact the dominant use case for which Hadoop has been deployed in production environments.

Yes, traditional (Hadoop-less) EDWs can in fact address this specific use case reasonably well — from an architectural standpoint. But given that the most cutting-edge cloud analytics is happening in Hadoop clusters, it’s just a matter of time — one to two years, tops — before all EDW vendors bring Hadoop into their heart of their architectures. For those EDW vendors who haven’t yet fully committed to full Hadoop integration, the growing real-world adoption of this open-source approach will force their hands.

Where the next-generation EDW is concerned, the petabyte staging cloud is merely Hadoop’s initial footprint. Enterprises are moving rapidly toward the EDW as the hub for all advanced analytics. Forrester strongly expects vendors to incorporate the core Hadoop technologies — especially MapReduce, Hadoop Distributed File System, Hive, and Pig — into their core architectures. Again, the impressive growth in MapReduce as a lingua franca for predictive modeling, data mining, and content analytics will practically compel EDW vendors to optimize their platforms for MapReduce, alongside high-performance support for SAS, SPSS, R, and other statistical modeling languages and formats. We see clear signs that this is already happening, as with EMC Greenplum’s recent announcement of a Hadoop product family and indications from some of that company’s competitors that they have similar near-term road maps.

Read more

Predictions And Plans For Business Analytics In 2011

Blog post info and actions

Blog post body

James Kobielus

I love reporters. As someone with an M.A. in journalism who then evolved into an analyst, I recognize that both professions occupy approximately the same tier in the industry food chain. In fact, many IT industry analysts were trade press reporters at one point in their careers, and it’s not uncommon for analysts to go back into media institutions later on.

When great longtime IT reporters, such as Computerworld’s Jaikumar Vijayan, call me up to get my thoughts, I’m just as interested in their take on what’s important. Jai recently published an excellent article with my predictions, plus those of another analyst, on the year ahead in analytics. To the jaded reader, these sorts of year-end look-ahead articles may feel like perfunctory rehashes of stuff we’ve been telling them for quite some time, perhaps with a trendy new buzzword thrown in to keep it remotely glance-worthy.

I try not to repeat myself too much. Rather than regurgitate the statements I made in the phone interview with Jai, I’ll highlight how I’m addressing the principal business-analytics trends that I discussed with him — self-service, pervasive, social, scalable, cloud, and real-time—in our 2011 Forrester research agenda:

Read more

Interdictive Analytics: Catching Baddies At The Pass And In The Nick Of Time

Blog post info and actions

Blog post body

James Kobielus

Predictive analytics is not just about forecasting what’s coming down the pike. It’s also about keeping the bad alternative futures from happening. If you can see the nasty things that might happen far enough in advance, you have a better chance of neutralizing or squelching them entirely.

In fact, many real-world applications of predictive analytics are “interdictive,” a term often used in military and law enforcement contexts to refer to tactics that delay, disrupt, or shut down an adversary’s forces or supply routes before they can do damage. Anti-fraud is one of the principal interdictive applications of predictive analytics technology. Companies everywhere rely on data mining to determine who’s been engaging, alone or in groups, in stealing money, supplies, finished goods, cellular airtime, and other valuables — and also where they’re likely to strike next. Likewise, anti-terrorism efforts rely on predictive models to sift through massive collections of historical and real-time intelligence in a Jack Bauer-like race against time and imminent disaster. You best believe that social network analysis is a key weapon in your arsenal for predicting and interdicting these sorts of malignant social patterns.

Read more

Findings From Forrester Wave: Customer Service Analytics Empower The Predictive Process

Blog post info and actions

Blog post body

James Kobielus

Bill Band and I have just published the latest Forrester Wave on customer relationship management (CRM) customer service solutions. We included 19 vendors in this in-depth market study and evaluated their solutions against 196 criteria. As you can well imagine, it took time and a fine-toothed analytical comb to compile the research and double-check the facts before we scored these sophisticated product suites. 

One of the key findings from this Forrester Wave is that a growing range of CRM vendors have incorporated deep analytics features into their customer service capabilities. Most provide embedded, out-of-the-box business intelligence (BI) features such as reporting, query, online analytical processing, dashboarding, scorecarding, and key performance indicators prebuilt to support their customer service applications. That’s no surprise, because these core BI features enable enterprises everywhere to keep track of how well they’re providing customer service across diverse CRM interaction channels and to identify opportunities to improve satisfaction, retention,  upsell, agent productivity, and other key metrics. 

Read more

Commercializing Enterprise-Grade Hadoop: Tools For Harnessing Petabyte Analytics

Blog post info and actions

Blog post body

James Kobielus

Hadoop is riding the hype wave right now. You’ll find many IT professionals who know just enough about Hadoop to be dangerous in a cocktail party setting, but not enough for their own comfort to respond to grilling from the chief technology officer or the geekier business executives.

If you’re slightly bewildered by all the buzz over this new technology with the funny-sounding moniker, you’re not alone. The official story is that Hadoop was the name of the inventor’s kid’s stuffed elephant. However, for most IT professionals, it could easily be an acronym for "Heck, Another Darn Obscure Open-Source Project." The fact that Hadoop, managed by Apache, includes subprojects with similarly opaque names — such as Pig, Hive, Chukwa, and ZooKeeper — contributes to the queasy feeling that this is an untamed menagerie of squealing beasties.

And if you’ve pegged Hadoop as an advanced analytics initiative to mine petabytes of unstructured information, prepare for further bewilderment. The Apache Hadoop project states that it develops open-source software for “reliable, scalable, distributed computing.” Yes, that’s true, but the better-informed among you may be puzzling over the linkages that people often draw between Hadoop, in-database analytics, and MapReduce.

Read more

“What BI Is Not” Forrester TweetJam Recap And Takeaways

Blog post info and actions

Blog post body

Rob Karel

On May 13th, Forrester analysts Boris Evelson, Jim Kobielus, Gene Leganza, Holger Kisker and Noel Yuhanna joined me in hosting a data management TweetJam on the topic “What BI is Not!” using the hashtag #dmjam. (You can still see the results and ongoing conversation if you search the hashtag.)

During this one-hour TweetJam, we asked the following questions, leaving 10 minutes of Tweet-time between each question:

  • Do you prefer the broad or the narrow definition of BI? Should ETL, DQ, DW, MDM be considered part of BI?
  • How should we differentiate BI and analytics?
  • What’s the difference between business intelligence and other forms of “intelligence” like competitive intelligence, market intelligence?
  • Is convergence of structured and unstructured information hype or reality?
  • Is BI looking only through the rear-view mirror, or should historical and predictive BI be one and the same?
  • How will social media impact traditional BI?

The response to this event was extraordinary, and we have a large community of data management and BI thought leaders who joined the conversation to thank. During that single hour there were over 360 Tweets with 65 unique Tweeters actively joining the conversation (not including those who only listened). If you include Tweets leading up to the event and the continued conversation after the event, we’ve seen over 480 Tweets and over 100 Tweeters … and growing.  

But what did we accomplish (aside from providing an entertaining distraction for a number of people)? Below, I’ve summarized a sampling of the takeaways that were shared by some of our participants on each question:

1. Do you prefer the broad or the narrow definition of BI? Should ETL, DQ, DW, MDM be considered part of BI?

Read more

Number of People Using Advanced Analytics

Blog post info and actions

Blog post body

James Kobielus

Guesstimates are often essential for market sizing and trending. To be useful, especially where primary data are lacking, they demand a valid conceptual framework. 

Like you, I’m looking forward to the responses to Boris Evelson’s quick Web-based survey, which you can access from his most recent blogpost.It’s always a challenge to assess how truly pervasive BI is—and pervasive it could potentially become.

To generate a valid first approximation, Boris scoped his blog comments and quick survey to “traditional BI” applications (i.e., historical reporting, query, dashboarding). He scoped his estimate only to large enterprise and midmarket firms (i.e., those with 100 or more employees) and only to BI usage in the US.

In order to keep this task manageable, Boris excluded some use cases that are often included in the “traditional BI” category: spreadsheets and other “homegrown” analytics apps; BI embedded in line-of-business apps; and non-interactive, static, published BI outputs. He leveraged both public and Forrester-gathered primary data to gauge how many actual and potential BI users there might be.

Scoping it as he did, Boris estimated that slightly more than 1.5 million people in the US are using traditional BI applications, which is between 2-3 percent of the employees of BI-implementing firms. He suspects the actual percentage might be as high as 6-8 percent of employees, but he’s not sure. That’s why he’s running the Web-based quick survey.

Read more

Self-Service Predictive Modeling: Vendors Still Have Far to Go

Blog post info and actions

Blog post body

James Kobielus

Self-service analytics is one of my core coverage focus areas. It applies not just to business intelligence (BI) but also to advanced analytics. 

When, a few months ago, I uttered the immortal phrase “roll over rocket scientists,” I was referring more specifically to the need for pervasive self-service tools for predictive analytics and data mining (PA/DM). Considering that my recently published Forrester Wave on PA/DM Solutions primarily addressed the traditional requirements of “rocket scientist” experts in statistical analysis, I did not put a huge emphasis on data mining features geared to business analysts, subject matter experts, and other “non-technical” information workers. 

As I’ve stated in that blogpost and the follow-on podcast, the core problem with today’s PA/DM offerings is that many of them are power tools, not solutions that have been designed for the mass business market. Vendors such as SAS Institute, IBM/SPSS, KXEN, Oracle, Portrait Software, Angoss, FICO, and TIBCO Spotfire provide data mining specialists with feature-rich algorithm-powered solutions for modeling, scoring, regression, and other core PA/DM functions. Their core, traditional user base consists of statisticians, mathematicians, and other highly educated analytics professionals. 

Read more