Every true scientist must also be a type of data scientist, although not all self-proclaimed data scientists are in fact true scientists.
True science is nothing without observational data. Without a fine-grained ability to sift, sort, structure, categorize, analyze, and present data, the scientist can’t bring coherence to their inquiry into the factual substrate of reality. Just as critical, a scientist who hasn’t drilled down into the heart of their data can’t effectively present or defend their findings.
Fundamentally, science is a collaborative activity of building and testing interpretive frameworks through controlled observation. At the heart of any science are the “controls” that help you isolate the key explanatory factors from those with little or no impact on the dependent variables of greatest interest. All branches of science rely on logical controls, such as adhering to the core scientific methods of hypothesis, measurement, and verification, as vetted through community controls such as peer review, refereed journals, and the like. Some branches of science, such as chemistry, rely largely on experimental controls. Some, such as astronomy, rely on the controls embedded in powerful instrumentation like space telescopes. Still others, such as the social sciences, may use experimental methods but rely principally on field observation and on statistical methods for finding correlations in complex behavioral data.
Data science has historically had to content itself with mere samples. Few data scientists have had the luxury of being able amass petabytes of data on every relevant variable of every entity in the population under study.
The big data revolution is making that constraint a thing of the past. Think of this new paradigm as “whole-population analytics,” rather than simply the ability to pivot, drill, and crunch into larger data sets. Over time, as the world evolves toward massively parallel approaches such as Hadoop, we will be able to do true 360-degree analysis. For example, as more of the world’s population takes to social networking and conducts more of its lives in public online forums, we will all have comprehensive, current, and detailed market intelligence on every demographic available as if it were a public resource. As the price of storage, processing, and bandwidth continue their inexorable decline, data scientists will be able to keep the entire population of all relevant polystructured information under their algorithmic microscopes, rather than have to rely on minimal samples, subsets, or other slivers.
Clearly, the big data revolution is fostering a powerful new type of data science. Having more comprehensive data sets at our disposal will enable more fine-grained long-tail analysis, microsegmentation, next best action, customer experience optimization, and digital marketing applications. It is speeding answers to any business question that requires detailed, interactive, multidimensional statistical analysis; aggregation, correlation, and analysis of historical and current data; modeling and simulation, what-if analysis, and forecasting of alternative future states; and semantic exploration of unstructured data, streaming information, and multimedia.
Emerging ARM server Calxeda has been hinting for some time that they had a significant partnership announcement in the works, and while we didn’t necessarily not believe them, we hear a lot of claims from startups telling us to “stay tuned” for something big. Sometimes they pan out, sometimes they simply go away. But this morning Calxeda surpassed our expectations by unveiling just one major systems partner – but it just happens to be Hewlett Packard, which dominates the WW market for x86 servers.
At its core (unintended but not bad pun), the HP Hyperscale business unit Project Moonshot and Calxeda’s server technology are about improving the efficiency of web and cloud workloads, and promises improvements in excess of 90% in power efficiency and similar improvements in physical density compared with current x86 solutions. As I noted in my first post on ARM servers and other documents, even if these estimates turn out to be exaggerated, there is still a generous window within which to do much, much, better than current technologies. And workloads (such as memcache, Hadoop, static web servers) will be selected for their fit to this new platform, so the workloads that run on these new platforms will potentially come close to the cases quoted by HP and Calxeda.