Posted by James Kobielus on November 18, 2011
Data science has historically had to content itself with mere samples. Few data scientists have had the luxury of being able amass petabytes of data on every relevant variable of every entity in the population under study.
The big data revolution is making that constraint a thing of the past. Think of this new paradigm as “whole-population analytics,” rather than simply the ability to pivot, drill, and crunch into larger data sets. Over time, as the world evolves toward massively parallel approaches such as Hadoop, we will be able to do true 360-degree analysis. For example, as more of the world’s population takes to social networking and conducts more of its lives in public online forums, we will all have comprehensive, current, and detailed market intelligence on every demographic available as if it were a public resource. As the price of storage, processing, and bandwidth continue their inexorable decline, data scientists will be able to keep the entire population of all relevant polystructured information under their algorithmic microscopes, rather than have to rely on minimal samples, subsets, or other slivers.
Clearly, the big data revolution is fostering a powerful new type of data science. Having more comprehensive data sets at our disposal will enable more fine-grained long-tail analysis, microsegmentation, next best action, customer experience optimization, and digital marketing applications. It is speeding answers to any business question that requires detailed, interactive, multidimensional statistical analysis; aggregation, correlation, and analysis of historical and current data; modeling and simulation, what-if analysis, and forecasting of alternative future states; and semantic exploration of unstructured data, streaming information, and multimedia.
But let’s not get carried away. Don’t succumb to the temptation to throw more data at every analytic challenge. Quite often, data scientists only need tiny, albeit representative, samples to find the most relevant patterns. Sometimes, a single crucial observation or data point is sufficient to deliver the key insight. And — more often than you may be willing to admit — all you may need is gut feel, instinct, or intuition to crack the code of some intractable problem. New data may be redundant at best, or a distraction at worst, when you’re trying to collect your thoughts.
Science is, after all, a creative process where practical imagination can make all the difference. As data scientists push deeper into Big Data territory, they need to keep from drowning in too much useless intelligence. As this dude said recently, keep your big data pile compact and consumable, to facilitate more agile exploration of this never-ending, ever-growing gusher.