Posted by James Kobielus on November 17, 2011
Data scientists don’t work in isolation. As with any scientists, they rely on a wide range of people in adjacent roles to help them do their jobs as effectively as possible.
Think about science generally. In the historical development of modern science, the specialization of roles continues to proliferate. But today’s professional science establishment is a relatively recent phenomenon. Back in the Middle Ages — and even well into the modern era — scientists often had to be jacks of all trades in order to carry on their investigations. Until the 19th century, there were few professional scientists, research universities, or commercial labs. There were no eager, underpaid graduate students to press into service. Until the 20th century, most professional scientists had to build and maintain their own laboratories, invent and calibrate their own instruments, painstakingly record their own observations, and concoct and promote their own theories.
Today’s professional scientists — of which data scientists are a key category — have it much easier. Whether they work with particle accelerators or linear regression models, scientists know they don’t need to be their own chief cooks and bottle washers. They can make science their day job and rely on a host of others for all of the necessary supporting tools and infrastructure. We find the following broad division of labor in all of today’s scientific disciplines, including data science:
- Investigation. To be a true scientist, your core job must be to investigate reality to whatever depth is necessary and from all relevant angles. Data scientists conduct their investigations with statistical algorithms and interactive exploration tools that help them uncover nonobvious patterns in observational data. Actually, you can regard today’s business-oriented data science as a branch of the behavioral sciences, because most such initiatives focus on investigating factors that drive such human behaviors as customer churn, purchasing, and recommending.
- Instrumentation. A true scientist uses instrumentation suited to the phenomena that they’re observing, modeling, testing, and measuring. Without statistical modeling, predictive analysis, and other tools, data scientists would not have the pattern-finding instrumentation on which they rely. Likewise, the underlying platform components — including data warehousing, visualization, integration, and governance tools — are key pieces of the instrumentation that data scientists need for exploring deep data. Somebody has to provide all of these tools of the data scientist’s trade, hence the exploding ecosystem of “big data” solution providers.
- Institution. And a true scientist needs to make a steady living focusing on their investigations. The institutions that employ them may be public or private sector, nonprofit or commercial. The institutions that help them communicate and collaborate with other scientists may be professional associations, journals, or other forums. Right now in data science, we see a huge push toward open source models of collaboration. This is most obvious in the area of open source platform/tool-focused communities such as Apache Hadoop and R, but it’s the trend in all collective areas of human investigation.
Increasingly, today’s data scientists realize they must stand on the giant shoulders of social networks and other online forums to pool their collective brainpower.