So, I just got back from Forrester’s Customer Experience Forum in New York. This year, it was at the Marriott Marquis, right in the heart of Times Square. Now, if you’re like me and have lived in a rural (ok, backwoods) town for the past 10 years, Times Square can be pretty overwhelming. You feel like you’re wading through a sea of people with every step. You hear more languages and see more diverse cultures in a block than in an around-the-world trip. And the neon and pictures and street-hawkers and . . . and . . . and. It’s total information overload.
Even worse, I had arranged to meet clients in the middle of this chaos. I was lost and running late. The call was short but clear: “Can you hear us? We’re here. Where are you? We need to leave soon.”
For many market insights professionals, my experience in Times Square is a microcosm of reality. Many have been stuck in the back office, already struggling to meet present stakeholders' needs. Suddenly, you're thrust into an overwhelming sea of new data sources with an executive mandate to find the customers and figure out their needs. Worse, if you don’t do this quickly, your customers are going to leave.
What’s clear is that Hadoop has already proven its initial footprint in the enterprise data warehousing (EDW) arena: as a petabyte-scalable staging cloud for unstructured content and embedded execution of advanced analytics. As noted in a recent blog post, this is in fact the dominant use case for which Hadoop has been deployed in production environments.
Yes, traditional (Hadoop-less) EDWs can in fact address this specific use case reasonably well — from an architectural standpoint. But given that the most cutting-edge cloud analytics is happening in Hadoop clusters, it’s just a matter of time — one to two years, tops — before all EDW vendors bring Hadoop into their heart of their architectures. For those EDW vendors who haven’t yet fully committed to full Hadoop integration, the growing real-world adoption of this open-source approach will force their hands.
Where the next-generation EDW is concerned, the petabyte staging cloud is merely Hadoop’s initial footprint. Enterprises are moving rapidly toward the EDW as the hub for all advanced analytics. Forrester strongly expects vendors to incorporate the core Hadoop technologies — especially MapReduce, Hadoop Distributed File System, Hive, and Pig — into their core architectures. Again, the impressive growth in MapReduce as a lingua franca for predictive modeling, data mining, and content analytics will practically compel EDW vendors to optimize their platforms for MapReduce, alongside high-performance support for SAS, SPSS, R, and other statistical modeling languages and formats. We see clear signs that this is already happening, as with EMC Greenplum’s recent announcement of a Hadoop product family and indications from some of that company’s competitors that they have similar near-term road maps.
Problems don’t care how you solve them. The only thing that matters is that you do indeed solve them, using any tools or approaches at your disposal.
When people speak of “Big Data,” they’re referring to problems that can best be addressed by amassing massive data sets and using advanced analytics to produce “Eureka!” moments. The issue of what approach — Hadoop cloud, enterprise data warehouse (EDW), or otherwise — gets us to those moments is secondary.
It’s no accident that Big Data mania has also stimulated a vogue in “data scientists.” Many of the core applications of Hadoop are scientific problems in linguistics, medicine, astronomy, genetics, psychology, physics, chemistry, mathematics, and artificial intelligence. In fact, Yahoo’s scientists not only had a predominant role in developing Hadoop but — as exploratory problem-solvers — they are active participants in Yahoo’s efforts to evolve Hadoop into an even more powerful scientific cloud platform.
The problems that are best suited to Hadoop and other Big Data platforms are scientific in nature. What they have in common is a need for analytical platforms and tools that can rapidly scale out to the petabyte level and support the following core features:
Enterprises have options. One of the questions I asked the firms I interviewed as Hadoop case studies for my upcoming Forrester report is whether they considered using the tried and true approach of a petabyte-scale enterprise data warehouse (EDW). It’s not a stretch, unless you are a Hadoop bigot and have willfully ignored the commercial platforms that already offer shared-nothing massively parallel processing for in-database advanced analytics and high-performance data management. If you need to brush up, check out my recent Forrester Wave™ for EDW platforms.
Many of the case study companies did in fact consider an EDW like those from Teradata and Oracle. But they chose to build out their Big Data initiatives on Hadoop for many good reasons. Most of those are the same reasons any user adopts any open-source platform: By using Apache Hadoop, they could avoid paying expensive software licenses; give themselves the flexibility to modify source code to meet their evolving needs; and avail themselves of leading-edge innovations coming from the worldwide Hadoop community.
But the basic fact is that Hadoop is not a radically new approach to processing extremely scalable data analytics. You can use a high-end EDW to do most of what you can do with Hadoop with all the core features — including petabyte scale-out, in-database analytics, mixed-workload support, cloud-based deployment, and complex data sources — that characterize most real-world Hadoop deployments. And the open-source Apache Hadoop code base, by its devotees’ own admission, still lacks such critical features as the real-time integration and robust high availability you find in EDWs everywhere.
Most Hadoop-related inquiries from Forrester clients come to me. These have moved well beyond the “What exactly is Hadoop?” phase to the stage where the dominant query is “Which vendors offer robust Hadoop solutions?”
What I tell Forrester clients is that, yes, Hadoop is real, but that it’s still quite immature. On the “real” side, Hadoop has already been adopted by many companies for extremely scalable analytics in the cloud. On the “immature” side, Hadoop is not ready for broader deployment in enterprise data analytics environments until the following things happen:
More enterprise data warehousing (EDW) vendors adopt Hadoop. Of the vendors in my recent Forrester Wave™ for EDW platforms, only IBM and EMC Greenplum have incorporated Hadoop into the core of their solution portfolios. Other leading EDW vendors interface with Hadoop only partially and only at arm’s length. We strongly believe that Hadoop is the nucleus of the next-generation cloud EDW, but that promise is still three to five years from fruition. It’s likely that most EDW vendors will embrace Hadoop more fully in the coming year, with strategic acquisitions the likely route.
Early implementers converge on a core Hadoop stack. The companies I’ve interviewed as case studies indicate that the only common element in Hadoop deployments is the use of MapReduce as the modeling abstraction layer. We can’t say Hadoop is ready-to-serve soup until we all agree to swirl some common ingredients into the bubbling broth of every deployment. And the industry should clarify the reference framework within which new Hadoop specs are developed.