Posted by James Kobielus on September 4, 2008
Dreams do come true sometimes. Or, at the very least, they may start to feel less like dreams than intuitions that ripened a bit earlier in the dreamer's mind than in the world in which he or she may live.
The dream of a global analytics cloud - aka "data warehousing (DW) in the cloud," "DW 2.0," "DW as a Service" - is continuing to materialize, as evidenced by a steady stream of important industry developments. Perhaps "cloud" is the wrong metaphor, considering that this vision is more of an expanding hypersphere of deep data that, through its massive gravitation, pulls an ever-growing nebula of complex computational challenges into its orbit.
Maybe we should call this uber-DW the "analytics orb" - in other words, the conceptual mothership of the industry's growing focus on "in-database analytics." Under this vision, analytics migrate to the DW platform and leverage its full parallel-processing, partitioning, scalability, and optimization functionality. Why move huge data sets to other platforms to be processed when all that analytical heavy lifting can be done on the most powerful, scalable, and cost-effective platform [appliance, cloud, orb] available - that also happens to be the planet where the data permanently resides?
For I&KM professionals, this vision is starting to become a commercial reality, as implemented by a growing range of DW vendors, both startup and veteran. The most recent industry development in this regard was last week's announcements by DW vendors Greenplum and Aster Data that they have implemented the Google-developed parallel computation API called MapReduce in their respective products. For its part, Google has been using MapReduce in its massive search environment to efficiently query petabytes of data - unstructured, semi-structured, structured - through MPP-optimized SQL extensions. One of the key innovations with MapReduce is that it provides a framework for parallelizing any in-database analytical algorithm - not just SQL queries (parallelizing the latter is old hat - it's what every vendor of a shared-nothing MPP DW has long provided).
Another important recent development came from DW pure-play Netezza. Several months ago, it acquired predictive analytics tool vendor, NuTech, announcing that this firm's technology would help Netezza evolve its DW appliance product family into an extensible platform for customer- and partner-provided analytic applications. Then last week Netezza announced that several partners had rolled out advanced analytics applications designed to leverage the parallel-execution, scalability, and query optimization features in its DW platform.
Oh...and of course DW powerhouses Teradata and Oracle have recently made partner-friendly in-database analytics a key theme in their DW strategies. Netezza certainly isn't the only DW vendor beating that drum. And I'm certainly not the only industry analyst who's been dreaming this dream. Check out the impressive cloud of industry commentary on the MapReduce announcements. There's a lot of low-dangling conceptual fruit in this new paradigm available to be plucked by any prepared mind.
So what's it all mean? Essentially, what all of these developments point to is the inexorable rise of the DW as the scalable, parallel-processing muscle within the new generation of analytics-driven application platforms. What's more, these developments point to the growing role of the DW as a general-purpose information-consolidation point in this new age of Web 2.0, unstructured data, and SaaS.
That said, here are the core DW capabilities in this new paradigm, as near as this analyst's crystal ball will reveal (I'm using "distributed analytic platform" as the catch-all term for this new paradigm):
Aggregate, process, persist, and deliver any combination of structured, semi-structured, and unstructured information in the distributed analytic platform
- Implement any optimal combination of logical and physical data storage approaches in the distributed analytic platform, including tokenized storage
- Implement any combination of premises-based and SaaS models in the distributed analytic platform
- Transparently virtualize the processing of analytic functions across any heterogeneous combination of operating systems, application servers, and runtime execution containers deployed throughout the distributed analytic platform
- Transparently virtualize the processing of analytic functions across any heterogeneous combination of nodes, CPUs, memory, server hardware, and storage devices throughout the distributed analytic platform
- Efficiently and scalably process any application functions and algorithms - BI, OLAP, ETL, DQ, predictive analytics, text analytics, data scoring, data clustering, etc - in the distributed analytic platform
- Enable flexible user/partner-driven extension and customization of the distributed analytic platform for in-database analytics, user-defined functions, stored procedures, and logical data models
- Support development of applications on the distributed analytic platform in any standard declarative or functional programming language
- Configure the distributed analytic platform to support any deployment topology, including centralized, hub-and-spoke, decentralized, and mesh, extending over intranet, extranet, and Internet environments
- Configure the distributed analytic platform to support any execution topology including any combination of on-disk and in-memory approaches
- Configure the distributed analytic platform for any storage, parallelism, partitioning, indexing, compression, caching, data reduction, query optimization, workload management, and pushdown optimization approach
Re the MapReduce announcements, what caught my attention was how many elements of my personal dream precipitated from that particular cloud. The smart engineering team at Google has definitely been hard at work. It makes sense that they've taken the lead on this powerful new approach. As a Web 2.0 success icon, Google definitely knows how to dream, and how to deliver. As for the prospects for their new Chrome browser, I'll leave that commentary to others.