Data Scientist: Is This Really Science Or Just Pretension?

Every true scientist must also be a type of data scientist, although not all self-proclaimed data scientists are in fact true scientists.

True science is nothing without observational data. Without a fine-grained ability to sift, sort, structure, categorize, analyze, and present data, the scientist can’t bring coherence to their inquiry into the factual substrate of reality. Just as critical, a scientist who hasn’t drilled down into the heart of their data can’t effectively present or defend their findings.

Fundamentally, science is a collaborative activity of building and testing interpretive frameworks through controlled observation. At the heart of any science are the “controls” that help you isolate the key explanatory factors from those with little or no impact on the dependent variables of greatest interest. All branches of science rely on logical controls, such as adhering to the core scientific methods of hypothesis, measurement, and verification, as vetted through community controls such as peer review, refereed journals, and the like. Some branches of science, such as chemistry, rely largely on experimental controls. Some, such as astronomy, rely on the controls embedded in powerful instrumentation like space telescopes. Still others, such as the social sciences, may use experimental methods but rely principally on field observation and on statistical methods for finding correlations in complex behavioral data.

Statistical controls are the bedrock of true science, and they are the core responsibility of the data scientist. Anybody who claims to be a data scientist but has never laid their hands on a multivariate statistical modeling tool or built a statistics-based predictive model, or who has no familiarity with computational linguistics, is not truly a data scientist. They may play other key roles in the data scientist ecosystem, such as managing Hadoop clusters or writing Hive queries, but are not scientists, in that they are not actively searching for nonobvious patterns in the data sets under their management. Similarly, computer scientists and mathematicians, who develop the algorithms and methods of data science, are not data scientists themselves unless they are also exploring for patterns in observational data.

Likewise, a BI professional or business analyst cannot legitimately upgrade their job title to “data scientist” unless they also upgrade their skills and make statistics-based interactive data exploration their core function.

Without the tools, skills, and focus of a true data scientist, it’s pretentious and false to suddenly start telling the world that you are one of them.

Comments

interesting

Hey James,

Did you actually do a controlled study of all people that declare themselves Data Scientists perhaps on Twitter or LinkedIn, then filter the ones that don't have stats skills to detect a trend?

Cheers,

- P

Structured survey of Data Scientists?

Sounds like a good idea. No, we haven't done such a study. Perhaps in the coming year, but no firm plans one way or another yet.

Real but rare

I think the data scientist is a real but rare role. But there is a lot about data that we don't understand, so perhaps we ought to have more of them.

Perhaps there is more pretension in executives and operators who presume to act on data they do not understand than in wannabe data scientists who lack the knowledge, skill, ability, and organizational support to actually do science.

Data scientists must be scientists

James, this precisely captures some of our organization's experience. I really love the sentence that all data scientists must be scientists. I can't tell you how often I have been echoing that sentiment. If you succeed to get the rest of the world to agree to your definition of the term, I will support it.

But as a computer scientist who spends all of my employed hours building tools for data analysts, I bristle at the sharp edges of your definition. I suppose that I have had some experiences where the modeler thinks that the problem is solved when the problem is solved. It is not.

Until the new model has been socialized and productionized, until the organization has adjusted to a new data ecosystem, until the new variables are under data governance, and until deployment and security have been finalized, no one is yet using the model. I personally prefer to extend the data scientist's role to include ongoing responsibility to see that the model is not destroyed by those other team members. That gives rise to a "computer scientist who specializes in data science projects"... and now it's hard for me to object if that person (who happens to be me) would prefer to fit within the definition of a data scientist also, even if my day rarely involves touching the multivariate modeling myself.

Which "sharp edges" of my definition of "data scientist"?

Kevin:

Thanks. Correlation definitely not the same as causation. A statistical analyst looks for the former, but is not truly a scientist unless they use the scientific method (e.g., "real-world experiments") to zero in on and obtain verification of the latter.

I'm curious which "sharp edges" to my definition of "data scientist" you take issue with. And which "model" are you referring to in the top of the 3rd paragraph of your response? Are you calling for the "data scientist" role to also encompass "data governance" on a master data set (e.g., customer data) within the enterprise data warehouse, or simply within the analytical data set that they used to build, train, and score their model? I'm assuming you mean the latter, but would like to hear your further thoughts and clarifications.

Sharp edges

Well, it's not major point of contention for me. But I suppose I am thinking a little more broadly in terms of "data science" projects rather than "data scientists" specifically. In my current role, although I myself am a computer scientist, my specialty has become data science projects. For the last seven years, this has comprised 90% of my projects -- Nielsen's core mission is data analytics, after all. So in my mind, just as database administrators and programmers are diverse roles within application development projects, I think that statisticians and ETL experts and other computer scientists fill the roles in data science projects. If the definition were as narrow as those who do the multivariate (or other) modeling, none of the projects would move forward. Within those of my own profession, I often speak of myself as a data science specialist... I suppose that I should now be more careful not to say "data scientist", just to avoid confusion.

So ... in terms of scope, I am perhaps pointing out the obvious. A data scientist works within a team to accomplish a project's goals, and needs to remain engaged as the focus of the project shifts from model development to deployment. I was speaking out of personal irritation with data scientists that I have worked with who have disengaged too quickly and the result was twisted beyond recognition by others.

Did you read my blog on adjacent roles?

Kevin:

We're in blinding agreement on the need for the adjacent roles in the data science process. See my blog "Data scientist: Which adjacent roles are central?" (http://blogs.forrester.com/james_kobielus/11-11-17-data_scientist_which_...). My primary focus is on the scientific process, and the role of the data scientist, as one of many, witihin it.

Jim

adjacent roles and primacy of science

Wholehearted agreement... and no, until you pointed me there, I had missed the November 17 posting which is excellent and also intriguing. I am very pleased to see that the science aspect is your focus. My current research and writing also focuses on how to bring the core science to the foreground in data analytics teams. So many who are not data scientists are working in this field, myself included, and the ramp up time and business risk to projects are simply enormous if you make the mistake to think that analytics consists only of adding a few more tools to existing software development or data warehousing teams. I propose an introduction to the scientific tributaries of analytics for practitioners with different expertise. Not so much a warning of potential pitfalls for non-data-scientists, but more of a roadmap to what scientific skills will be essential to master or to bring into the team for analytics success.

Please check alll my "data scientist" blogs

Kevin:

I published 5 consecutive blogs on data scientists. These were two of them. I presented a 360-degree perspective on the topic. I'd love to see your thoughts on them all.

Jim