Posted by James Kobielus on December 21, 2010
The Twittersphere keeps chirping with definitional disputes about what exactly constitutes an enterprise data warehouse (EDW). This is the sort of debate that we geekier analysts love to engage in, since it gives us a chance to beat our chests and brandish our superior powers of cogitation.
Since I have exposed skin in this game, I’ll flex my cognitive muscle a little bit more for those who wonder what Forrester’s position is on all this. Given that my update to the Forrester Wave™ for EDW platforms will come out in a month or so, this is probably a good time to level-set the discussion. In this post, I will also point to some trends that are pushing the boundaries of what an EDW is and can do for you.
Some have argued that an EDW requires a DBMS, but is not, in itself, a DBMS. I’ve heard it said that a DBMS only becomes an EDW when it incorporates a schema and stores data. Still others argue that an EDW is something entirely distinct from a DW (without the “enterprise”) modifier, a data mart, or an operational data store (ODS).
I find all of these perspectives hairsplitting and misleading, in that they blur the actual distinctions among these architectural constructs. Paradoxically and obliquely, though, they all hint at the rapid evolution of the EDW into something more protean and virtualized.
At heart, all of these concepts point to the foundation of any analytics initiative: an infrastructure for persisting, preparing, and delivering intelligence to downstream applications. Fundamentally, a traditional DW is an analytics-optimized DBMS for storing and managing structured data. Its architecture usually incorporates some variant or blend of relational and/or dimensional logical schemas, as well as columnar and/or row-based physical structure and disk-based and/or in-memory persistence.
I’m not telling DW professionals anything they don’t already know. However, the debate gets a little more contentious when we define what it means to prepend an “E” onto the concept of a DW. Clearly, the industry’s main schools in this matter were propounded ages ago by Bill Inmon and Ralph Kimball. Rather than define a Forrester school on such things, I’ll just lay out our core, layered EDW definition, which splits the differences and sits at the heart of our coverage.
According to Forrester’s perspective, the minimum “E” requirements for an EDW are:
- Provides an analytics-optimized information persistence and delivery layer.
- Aggregates information into integrated, nonvolatile, time-variant repositories under unified governance.
- Organizes information into subject-area data marts that correspond with one or more business, process, and/or application domain.
- Supports flexible deployment topologies such as centralized, hub-and-spoke, federated, independent data marts, and ODSes.
- Enables unified conformance and governance of detailed, aggregated, and derived information, as well as associated metadata and schemas, by business stakeholders.
- Extracts, loads, and consolidates information from sources through various approaches.
- Governs the controlled distribution of information to various downstream repositories, applications, and consumers.
- Maintains the availability, reliability, scalability, load balancing, mixed workload management, backup and recovery, security, and other robust platform features necessary to meet the most demanding, changing enterprise mix of analytics, data management, and decision support workloads.
Note that I have endeavored to leave the term “data” out of Forrester’s definition of EDW. This is because the EDW is evolving into an “enterprise content warehouse,” for persisting complex information (structured, unstructured, semistructured) from social media, enterprise content management (ECM), and other sources. As I pointed out in a Forrester blog earlier this year, Hadoop is a key technology for next-generation cloud-based EDWs optimized for complex content.
This points to another key trend in EDW evolution: the continued transformation of these infrastructures away from traditional centralized and hub-and-spoke topologies toward the new worlds of cloud-oriented and federated architectures. The EDW itself is evolving away from a single master “schema” and more toward a semantic abstraction layer and use of distributed in-memory information as a service (IaaS). Under this new paradigm, the next-generation EDW supports virtualized access to the disparate schemas of the relational, dimensional, and other constitute DBMS and other repositories that constitute a logically unified cloud-based resource.
Once again, we need to look to Hadoop as a harbinger of this new order. Its alternative persistence architectures — including the Hadoop Distributed File Store (file-based) and HBase (columnar) — show that these and other “NoSQL” technologies will be as fundamental to the next-generation EDW as the familiar relational and columnar databases.
The trend is toward a virtualized enterprise content cloud geared both to the traditional EDW roles supporting BI and operational reporting, and to the new world of advanced analytics for social media analytics, sentiment analysis, and many other compute-intensive functions involving complex content and dynamically shifting mixed workloads.
This EDW virtualization vision is consistent with what I expressed in a Forrester blog a year and a half ago. It also points to the differentiators, here and now, between the leading players in today’s fast-changing EDW market. And it helps you to understand how some future EDWs may totally lack the traditional underpinning of structured RDBMSes, especially as Hadoop, IaaS, and other approaches supply the virtualized persistence layer.
In the forthcoming Forrester Wave, don’t be surprised to see criteria that highlight vendor progress in evolving their EDW solution portfolios to meet this future head-on.