Posted by James Kobielus on November 16, 2011
Data scientists are a curious breed. The term encompasses a wide range of specialties, all of which rely on statistical algorithms and interactive exploration tools to uncover nonobvious patterns in observational data.
Who belongs in this category? Clearly, the “quants” are fundamental. Anybody who builds multivariate statistical models, regardless of the tool they use, might call themselves a data scientist. Likewise, data mining specialists who look for hidden patterns in historical data sets — structured, unstructured, or some blend of diverse data types — may certainly use the term. Furthermore, a predictive modeler or any analyst who builds fact-based what-if simulations is a data scientist par excellence. We should also include anybody who specializes in constraint-based optimization, natural language processing, behavioral analytics, operations research, semantic analysis, sentiment analysis, and social network analysis.
But these jobs are only one-half of the data-science equation. The “suits” are also fundamental. Any business domain specialist who works with any of the tools and approaches listed above may consider him- or herself a data scientist. In fact, if one and the same person is a black belt in SAS, SPSS, R, or other statistical tools, and also an expert in marketing, customer service, finance, supply chain, or other business specialties, they are a data scientist par excellence.
Both of these skill sets are fundamental to high-quality data science. Lacking statistical expertise, you can’t understand which are the most appropriate algorithms and approaches to make the foundation of your statistical models. Lacking business domain expertise, you can’t identify the most valid variables and appropriate data sets to build into your models around.
In establishing a data science center of excellence in your organization, you must institute forums, processes, training, tools, and other initiatives that bring people with these diverse skills together to collaborate on common projects. You must also encourage people from each camp to cross-train in the other’s area. Business analysts must learn more sophisticated statistical techniques than their schooling instilled in them and more sophisticated tools than their spreadsheets. Statistical analysts must attach themselves to business groups or functions and learn how to apply their quantitative smarts to real operational problems.
Is the garden-variety spreadsheet jockey a data scientist? Yes, to the extent that they build statistical models and use the tool to find nonobvious patterns in structured data, they are engaging in a form of data science. But if this exploration is not their primary job function, they are merely dabbling, not specializing.
Is BI report-building or OLAP cube-development data science? No. Those endeavors, although important, revolve around obvious data patterns — obvious in the sense that an organization has chosen to embed them in repeatable views and access patterns.
Data science is all about asking questions. You engage in it whenever you interactively and iteratively search for deep, hidden patterns.