Posted by James Kobielus on November 21, 2011
Every true scientist must also be a type of data scientist, although not all self-proclaimed data scientists are in fact true scientists.
True science is nothing without observational data. Without a fine-grained ability to sift, sort, structure, categorize, analyze, and present data, the scientist can’t bring coherence to their inquiry into the factual substrate of reality. Just as critical, a scientist who hasn’t drilled down into the heart of their data can’t effectively present or defend their findings.
Fundamentally, science is a collaborative activity of building and testing interpretive frameworks through controlled observation. At the heart of any science are the “controls” that help you isolate the key explanatory factors from those with little or no impact on the dependent variables of greatest interest. All branches of science rely on logical controls, such as adhering to the core scientific methods of hypothesis, measurement, and verification, as vetted through community controls such as peer review, refereed journals, and the like. Some branches of science, such as chemistry, rely largely on experimental controls. Some, such as astronomy, rely on the controls embedded in powerful instrumentation like space telescopes. Still others, such as the social sciences, may use experimental methods but rely principally on field observation and on statistical methods for finding correlations in complex behavioral data.
Statistical controls are the bedrock of true science, and they are the core responsibility of the data scientist. Anybody who claims to be a data scientist but has never laid their hands on a multivariate statistical modeling tool or built a statistics-based predictive model, or who has no familiarity with computational linguistics, is not truly a data scientist. They may play other key roles in the data scientist ecosystem, such as managing Hadoop clusters or writing Hive queries, but are not scientists, in that they are not actively searching for nonobvious patterns in the data sets under their management. Similarly, computer scientists and mathematicians, who develop the algorithms and methods of data science, are not data scientists themselves unless they are also exploring for patterns in observational data.
Likewise, a BI professional or business analyst cannot legitimately upgrade their job title to “data scientist” unless they also upgrade their skills and make statistics-based interactive data exploration their core function.
Without the tools, skills, and focus of a true data scientist, it’s pretentious and false to suddenly start telling the world that you are one of them.