An Enterprise Data Warehouse Without A Database—Is That Even Conceivable?

 The Twittersphere keeps chirping with definitional disputes about what exactly constitutes an enterprise data warehouse (EDW). This is the sort of debate that we geekier analysts love to engage in, since it gives us a chance to beat our chests and brandish our superior powers of cogitation. 

Since I have exposed skin in this game, I’ll flex my cognitive muscle a little bit more for those who wonder what Forrester’s position is on all this. Given that my update to the Forrester Wave™ for EDW platforms will come out in a month or so, this is probably a good time to level-set the discussion. In this post, I will also point to some trends that are pushing the boundaries of what an EDW is and can do for you. 

Some have argued that an EDW requires a DBMS, but is not, in itself, a DBMS. I’ve heard it said that a DBMS only becomes an EDW when it incorporates a schema and stores data. Still others argue that an EDW is something entirely distinct from a DW (without the “enterprise”) modifier, a data mart, or an operational data store (ODS). 

I find all of these perspectives hairsplitting and misleading, in that they blur the actual distinctions among these architectural constructs. Paradoxically and obliquely, though, they all hint at the rapid evolution of the EDW into something more protean and virtualized. 

At heart, all of these concepts point to the foundation of any analytics initiative: an infrastructure for persisting, preparing, and delivering intelligence to downstream applications. Fundamentally, a traditional DW is an analytics-optimized DBMS for storing and managing structured data. Its architecture usually incorporates some variant or blend of relational and/or dimensional logical schemas, as well as columnar and/or row-based physical structure and disk-based and/or in-memory persistence. 

I’m not telling DW professionals anything they don’t already know. However, the debate gets a little more contentious when we define what it means to prepend an “E” onto the concept of a DW. Clearly, the industry’s main schools in this matter were propounded ages ago by Bill Inmon and Ralph Kimball. Rather than define a Forrester school on such things, I’ll just lay out our core, layered EDW definition, which splits the differences and sits at the heart of our coverage. 

According to Forrester’s perspective, the minimum “E” requirements for an EDW are: 

  • Provides an analytics-optimized information persistence and delivery layer.
  • Aggregates information into integrated, nonvolatile, time-variant repositories under unified governance.
  • Organizes information into subject-area data marts that correspond with one or more business, process, and/or application domain.
  • Supports flexible deployment topologies such as centralized, hub-and-spoke, federated, independent data marts, and ODSes.
  • Enables unified conformance and governance of detailed, aggregated, and derived information, as well as associated metadata and schemas, by business stakeholders.
  • Extracts, loads, and consolidates information from sources through various approaches.
  • Governs the controlled distribution of information to various downstream repositories, applications, and consumers.
  • Maintains the availability, reliability, scalability, load balancing, mixed workload management, backup and recovery, security, and other robust platform features necessary to meet the most demanding, changing enterprise mix of analytics, data management, and decision support workloads.

Note that I have endeavored to leave the term “data” out of Forrester’s definition of EDW. This is because the EDW is evolving into an “enterprise content warehouse,” for persisting complex information (structured, unstructured, semistructured) from social media, enterprise content management (ECM), and other sources. As I pointed out in a Forrester blog earlier this year, Hadoop is a key technology for next-generation cloud-based EDWs optimized for complex content. 

This points to another key trend in EDW evolution: the continued transformation of these infrastructures away from traditional centralized and hub-and-spoke topologies toward the new worlds of cloud-oriented and federated architectures. The EDW itself is evolving away from a single master “schema” and more toward a semantic abstraction layer and use of distributed in-memory information as a service (IaaS). Under this new paradigm, the next-generation EDW supports virtualized access to the disparate schemas of the relational, dimensional, and other constitute DBMS and other repositories that constitute a logically unified cloud-based resource. 

Once again, we need to look to Hadoop as a harbinger of this new order. Its alternative persistence architectures — including the Hadoop Distributed File Store (file-based) and HBase (columnar) — show that these and other “NoSQL” technologies will be as fundamental to the next-generation EDW as the familiar relational and columnar databases. 

The trend is toward a virtualized enterprise content cloud geared both to the traditional EDW roles supporting BI and operational reporting, and to the new world of advanced analytics for social media analytics, sentiment analysis, and many other compute-intensive functions involving complex content and dynamically shifting mixed workloads. 

This EDW virtualization vision is consistent with what I expressed in a Forrester blog a year and a half ago. It also points to the differentiators, here and now, between the leading players in today’s fast-changing EDW market. And it helps you to understand how some future EDWs may totally lack the traditional underpinning of structured RDBMSes, especially as Hadoop, IaaS, and other approaches supply the virtualized persistence layer. 

In the forthcoming Forrester Wave, don’t be surprised to see criteria that highlight vendor progress in evolving their EDW solution portfolios to meet this future head-on.

Comments

Does the fact that you do not

Does the fact that you do not mention the in-memory approach imply that you think this is still a DBMS? I do, but it mainly addresses speed and agility in data acquisition and information consumption rather than virtualization and deployment options.

I do mention in-memory

twice

My mistake

Of course, you are right... I am not sure how I missed that! Have a great Holiday and sorry if I made your headache worse!

Forrester doesnt get to redefine EDW

The definitions of EDWs, marts, and ODS have been clear for many years. Forrester does not get to redefine the good work of hundreds of people who struggled to define and accurately use for 20 years. Messing with something that has years of clarity obfuscates what users buy, lets vendors create noise and hype, lets SIs misrepresent what they do, and so on.

A data warehouse must be subject oriented, nonvolatile and consistent, integrated, time variant, and non-virtual. You are correct that the universally accepted definition does not mandate a DBMS. But it does demand something like a schema, integration of data structures across subject areas, data is held for long periods without changes, etc. Most of the new wave products like Hadoop and NoSQL cannot handle even one of these requirements.

Also notice that an EDW is not a technology discussion, its a design pattern.

If you want to envision RDBMS encompassing more than their own data -- as in Hadoop -- that's OK. But do not confuse it with a EDW. Come up with a new name for it, a new buzzword. The EDW can and does continuously evolve within its definition. Changing the EDW definition and allow hype, lies, and confusion to reign. Plus, it needs consensus across the analysts community, not one rogue opinion.

I saw your Twittersphere discussion on this. I believe two experts -- Merv Adrian and Colin White -- rejected your theory. One of them remarked "You're drinking too much jungle juice! "

Enterprise Information Web

What you are describing in this post is what we call the Enterprise Information Web. The EIW enables a new kind of analytic capability we call Emergent Analytics. These capabilities are based completely on W3C based semantic technologies (RDF,OWL,and SPARQL) but deliver most of the capability you describe in exactly the way you describe it. We are even looking at Hadoop for the persistence layer. You are right on the money on every point.

If the goal of a data

If the goal of a data warehouse is really to provide a way to report on corporate business information, why is the selection of a particular technology, such as a database, the critical requirement?

Just because you've always done it that way, doesn't mean it's the way it should always be done.

There's value in both. It's a matter of understanding what the value propositions are, and why you actually need both!

From an IBM perspective, we see non-relational database reporting capabilities enabling a new paradigm in unstructured business insight. These are complementary, and integrate with traditional BI tools and technologies - it's not a decision point, it's a new opportunity to understand your business better.

Currently our IBM Content Analytics solution is based on Lucene and UIMA. Who knows, but maybe Hadoop is the right next step for the massive scale that enterprise content requires.

http://www-01.ibm.com/software/data/content-management/analytics/

Regards,
Paul O'Hagan
IBM ECM - Offering Manager
pohagan@ca.ibm.com
@paul_pohagan