Posted by James Kobielus on June 9, 2011
We’ve all been through this many times before. So when will it be?
The same thing happens every time. Some shiny new thing gets built up until it’s too big for its britches and then we delight in shooting it down. Or taking it down a few notches until, chastened, it accepts its less-than-lofty position in the divine order of all things IT.
Hadoop is no fad, but it is definitely getting set up for a sober reappraisal — possibly by this time next year, or as soon as a significant number of major EDW vendors roll out their Hadoop products and strategies. I’ve already painted in broad brushstrokes the milestones that Hadoop needs to pass to be considered truly ready for enterprise prime time. I’m reasonably confident that it will meet those challenges over the next two to three years. I’m even willing to meet the open-source absolutists halfway on their faith that the Apache community will be guided by some invisible hand toward a single market-making distro with universal interoperability, peace, love, and understanding.
But even if Hadoop stays on track toward maturation, we’re likely to see the inevitable backlash emerge, spurred by the widespread impatience that usually follows overweening hype. The snarkfest will come as analytics pros start to realize that, promising as this new approach may be, there are plenty of non-Hadoop EDWs that can address the core petabyte-scale use cases I laid out. Many IT practitioners will ask why they should pay good money for a new way of doing things, with all the concomitant disruptions and glitches, when they can simply repurpose their investments in platforms like Teradata, Oracle, IBM, and Microsoft.
The broader backlash will be against “Big Data” as a paradigm. At times, it almost feels like people discuss Big Data with the assumption that bigger is necessarily better and that throwing more data at your problems will automatically produce insights. I hope business and IT professionals heed my advice about searching for those special problems, often of a scientific nature, that can be solved best through petabyte-scale analytics. You don’t need a data center full of maxed-out storage arrays to derive powerful insights. Gut feel is free, and it often thrives on the scantiest information.
The immovable object that Big Data will need to overcome is the limited IT budget. Until petabytes become dirt cheap, few companies can justify the hardware necessary for storing, processing, and managing all this data. The best way for Hadoop specifically, and Big Data generally, to avoid the beancounter’s axe will be, ironically, by staying as small as practical. As IT pros bring Big Data into their core EDW strategies, they will apply every storage-optimization approach in their arsenals — columnar, data deduplication, compression, multitemperature, partitioning, filtering, archiving, purging, etc. — to keep the data tsunami in check.
It won’t be pretty, but it will be essential for Big Data to avoid becoming a big budget buster.