Posted by James Kobielus on April 1, 2009
In a recent article, Bill Inmon incinerates a strawman concept that he refers to as “virtual data warehousing (DW).” For those unfamiliar with Inmon, he is generally considered the founder of DW as a data management discipline, has been at it since the 70s, and has more published books and articles to his name than most mortals. So he clearly may be considered an authority on the topic of DW.
But methinks Mr. Inmon doth protest too much on this “virtual DW” bugaboo, however defined (we’ll get to that in a moment). Also, he attacks this concocted notion with such emotional vehemence that it’s clear he considers it a threat to the centralized EDW paradigm upon which he has built his career and reputation.
For starters, his definition of this concept is oddly vague and questionably narrow: “a virtual data warehouse occurs when a query runs around to a lot of databases and does a distributed query.” Essentially, Inmon defines “virtual DW” as the ability to a) farm out a query to be serviced in parallel by two or more distributed databases, b) aggregate and join results from those databases, and c) deliver a unified result set to the requester.
That’s an important query pattern, but not the only one that should be supported under (pick your quasi-synonym) data federation, data virtualization, or enterprise information integration (EII) architectures. Inmon’s definition excludes the many federated queries that may only hit on a single database, with no joins and results aggregation, and with the EII fabric handling the necessary on-demand transformation from that source’s schema to an abstract semantic model.
Per my data federation report from last fall, Forrester has a broader perspective on the topic than does Mr. Inmon. Data federation is any on-demand approach that queries information objects from one or more sources; applies various integration functions to the results; maps the results to a source-agnostic semantic-abstraction model; and delivers the results to requesters. Nothing in the scoping of data federation necessarily requires the multi-source aggregation and joining that Inmon puts at the heart of “virtual DW.”
Putting Inmon’s narrow scoping of “virtual DW” behind us for the moment, let’s consider his chief objections to this approach. First, it requires the “analyst to integrate data” (as if that’s something analysts are ill-suited for or regard as some inordinate burden). Second, it consumes resources, experiences suboptimal performance, and “shuffles a lot of data around the system that otherwise would not need to be moved” (as if centralized DWs don’t consume resources, experience performance bottlenecks, and move data). Third, it is “limited to the [historical] data found in the [source] databases.” Fourth, it suffers from “no reconcilability of data...[hence] no single version of the truth for the corporation.”
It’s a fairly straightforward matter to dispatch these objections:
First, data integration--through ETL, EII, and other approaches--is a core job function for DW professionals, not some alien function outside their core competency.
Second, data federation is often the optimal approach for low-latency BI (just check out the case studies in my data federation and really urgent analytics reports). Federated environments can be tuned to provide top-notch performance and minimize source-system impacts when “shuffling” data around in a decentralized fabric.
Third, the source databases in a federation environment often include DWs, which, per their core function, usually manage a considerable amount of historical data. Once again, see my data federation report with discussion of case studies for a) Federation of Local DWs via Centralized EII Infrastructure and b) Federation of Dispersed EDW and ODS Data Into Siloed BI Environments.
Fourth, data federation is not totally incompatible with data reconciliation. In fact, federation environments can be architected for single version of the truth, data governance, and master data management. However, it can indeed be tricky to manage data quality in federated environments (see Rob Karel’s coverage of MDM and DQ for a deep dive on that issue).
My basic objection to Inmon’s line of discussion is that he treats data federation as mutually exclusive from the enterprise DW (EDW), when in fact they are highly complementary approaches, not just in theory but in real-world deployments. Yes, data federation can be deployed as an alternative to traditional EDWs, providing direct interactive access to online transactional processing (OLTP) data stores. However, data federation can also coexist with, extend, virtualize, and enrich EDWs, as well as other data-persistence nodes such operational data stores (ODS) and online analytical processing (OLAP) data marts. The case studies in the cited reports bear that out.
Inmon’s arguments are worth consideration. The centralized EDW model he touts is useful for illuminating some traditional best practices. But by no means can it do justice to the stubbornly heterogeneous, distributed, mixed-latency BI and DW requirements of most enterprises.