Data Scientist: Do You Truly Need Big Data?

Data science has historically had to content itself with mere samples. Few data scientists have had the luxury of being able amass petabytes of data on every relevant variable of every entity in the population under study.

The big data revolution is making that constraint a thing of the past. Think of this new paradigm as “whole-population analytics,” rather than simply the ability to pivot, drill, and crunch into larger data sets. Over time, as the world evolves toward massively parallel approaches such as Hadoop, we will be able to do true 360-degree analysis. For example, as more of the world’s population takes to social networking and conducts more of its lives in public online forums, we will all have comprehensive, current, and detailed market intelligence on every demographic available as if it were a public resource. As the price of storage, processing, and bandwidth continue their inexorable decline, data scientists will be able to keep the entire population of all relevant polystructured information under their algorithmic microscopes, rather than have to rely on minimal samples, subsets, or other slivers.

Clearly, the big data revolution is fostering a powerful new type of data science. Having more comprehensive data sets at our disposal will enable more fine-grained long-tail analysis, microsegmentation, next best action, customer experience optimization, and digital marketing applications. It is speeding answers to any business question that requires detailed, interactive, multidimensional statistical analysis; aggregation, correlation, and analysis of historical and current data; modeling and simulation, what-if analysis, and forecasting of alternative future states; and semantic exploration of unstructured data, streaming information, and multimedia.

But let’s not get carried away. Don’t succumb to the temptation to throw more data at every analytic challenge. Quite often, data scientists only need tiny, albeit representative, samples to find the most relevant patterns. Sometimes, a single crucial observation or data point is sufficient to deliver the key insight. And — more often than you may be willing to admit — all you may need is gut feel, instinct, or intuition to crack the code of some intractable problem. New data may be redundant at best, or a distraction at worst, when you’re trying to collect your thoughts.

Science is, after all, a creative process where practical imagination can make all the difference. As data scientists push deeper into Big Data territory, they need to keep from drowning in too much useless intelligence. As this dude said recently, keep your big data pile compact and consumable, to facilitate more agile exploration of this never-ending, ever-growing gusher.


Validity and Reliability still a challenge

I have so many thoughts regarding the future of "data science" and "big data" in general -- most of my thoughts are positive -- however, I have a caveat -- increasingly, I get the impression that some believe that "big data" can be mined with greater ease than other forms of data, and with less regard for scientific rigor -- said another way, some believe that modern data mining somehow relaxes the necessity for validity and reliability testing in route -- I recently was exposed to some employee productivity data at a large public corporation, and was horrified by the generalizations that had been reached by the data analysis staff -- essentially, someone came up with some ideas to test for correlations in their data, and then built the correlations into dashboards that portrayed the correlations as causal -- honestly, the way the data was presented on the dashboard, the decision-maker would not and could not "see" the reality of spurious correlations throughout -- nevertheless, this dashboard is now in active use at this company, regardless of the lack of scientific rigor applied to the dashboards validity and reliability issues -- yes, "big data" and "data scientists" have the real potential to make decision-making more informed -- however, when "big data" degenerates into journalistic levels of validity and reliability, the potential to do harm is great -- everyone in the data science business, including statisticians, operations engineers, and so forth, have a hand in where we go from here -- let's hope that standards of reliability and validity remain integral to "data science" along the way...

I second everything you just said


Yes, I see that same disturbing trend as well. By sanctifying anything "Big Data" with the honorific "Data Science," there's the implicit assumption that whatever analysis you produce is, by virtue of that and no matter how shoddy your methodology or model, "scientificj," hence "true" or "valid" or "genius." Correlations are not causations, as anybody with even a smattering of Statistics 101 should know. Check out my other blogs on this topic these past 5 days for my fuller perspective on all this.



I'm the editor of AOL Government and am interested in reprinting your blog. If you're open to that, please reply to (Sorry for contacting you here. Didn't find a ready email link on your page. Thanks.

Have forwarded your request


I've forwarded your request to Phil LeClare (, who handles all such reprinting/syndication requests from media. He'll back with you promptly.


Primacy of science in data science

I agree that the integrity of scientific methods comes first, and how much data comes second. I also agree that wealth of data provides opportunity for some analyses that would not otherwise be possible. We have direct experience with both long tail and microsegmentation which both require an order of magnitude increase in data volume.

I don't see any tension between trying to handle big data and trying to enforce rigorous methodology. Both have become inescapable. Big data is the newest additional pressure on the work of analytics, so it gets a lot of current attention.

What is truly new to me, at least, is working with the fuzzy nature of unstructured data and the touchy-feely nature of insights derived from it. Analytics against unstructured data is truly new to most businesses. It may not be a surprise to skilled data scientists, but in my experience, database administrators and software architects and BI analysts are frequently unfamiliar with unstructured data. Whenever big data efforts bring along unstructured data, the business teams will have a learning curve to adapt.

There's no inherent tension btwb data scale & scientific rigor


Agreed. The tension is betwen the scale of the scientific endeavor--the number of participants, hypotheses, tests, etc.--and scientific rigor. Clearly, the issue is one of process controls and governance, to ensure that a team of larger scale continues to adhere to a methodology that ensures replicability and defensibility of any findings.

As regards "unstructured data," that's not new to most behavioral scientists: it's what any human emits all of the time: language. A big part of the challenge is rendering language and other unstructured behavior patterns in structured formats that enable measurement, correlation, and pattern-discovery.

Scientists must adapt their metrics to the phenomena being studied. The world is "polystructured," and scientists study the world. Nothing new, really.



" Sometimes, a single crucial observation or data point is sufficient to deliver the key insight. "
An example would be nice, where you have solved a truly complex problem with one data point!!!
I understand you are trying to make your point but I cannot accept statements at face value.
The value of data and data driven modeling is to be able to understand phenomena at an abstract level.
That said having a bunch of data and throwing a random algorithm at it doesn't help