If Queries Are King In The Realm Of The Analytic Database, Does That Make Columnar Heir Apparent?

What’s the perfect database management system (DBMS) architecture, where analytics is concerned? It seems as if everybody in business intelligence (BI), data warehousing (DW), and related areas has their own opinion on this topic. In fact, it’s more than just opinions. In the database wars, we have huge communities of vendors and users with substantial investments in one or more approaches—from traditional relational DBMSs to column-oriented, in-memory, dimensional, inverted indexing, and other approaches. 

In the analytic database wars, new architectures are springing up everywhere, each with its own devotees and differentiators, and each with a go-to-market message that takes potshots at established approaches.  The “No SQL” movement is just the latest coalition to emerge in this titanic struggle, and is essentially a loose coalition of diverse approaches rallying around a common theme: that traditional RDBMSs and their SQL-based access languages are unfit, or at least, maladapted to the new world of cloud, social networking, and Web-oriented analytics applications. 

DW industry analysts such as myself are of course embedded in these wars—we’re the neutral observers often caught in the crossfire. Of course, this is not a new battle. The columnar database industry has been around for many years and positioned itself as the chief high-performance, low-footprint analytics contender to  traditional row-based RDBMSs. And columnar has steadily encroached on relational’s turf, not only through established columnar-based DW vendors such as Sybase, Vertica, and ParAccel, but also through recent adoption (albeit in limited fashion) by the likes of Oracle, SAP, and others into their analytics architectures. 

Query performance is king in the analytics wars, and columnar has important advantages, especially when used to support structured, repeatable scan and queries against very large aggregated tables. Query speed advantages have also driven adoption of dimensional (i.e., denormalized relational) databases for online analytical processing (OLAP). It’s also what’s driving the push to in-memory cache-based data persistence and solid-state drives, which ramp up the input-output operations per second by an order of magnitude. 

So, the million-dollar question (actually, the stakes are far higher) is: Will columnar, or any of the other contenders, eventually drive relational/row-based databases to extinction in the realm of BI, DW, and advanced analytics? Or, if the outlook isn’t that extreme, will columnar and other approaches doom relational databases to a greatly reduced role in analytics.  My educated hunch on the first question is “no,” but a full-throated “yes” on the second. 

Relational databases will be with us for a long time to come.No one’s expecting non-relational DBMSs to gain any significant traction in online transaction processing (OLTP), where it serves as the basis for most mission-critical business applications. Even if there were some critical flaw to relational where transaction processing is concerned, year upon year of massive corporate investment in the technology, tools, and applications would argue against any hasty migration. Where analytics is concerned, let’s not forget that many of the business world’s reports and queries are against these transaction-oriented relational stores. Most of the world’s business data is stored in row-based repositories and most applications have been written to access and manipulate this data through SQL and other standard approaches. 

But even in the analytics-optimized world of stand-alone BI and DW solutions, columnar won’t completely push relational databases over the parapet and into the moat. Let’s not forget that most of the world’s deployed DWs are built on relational technology, and vendors such as Teradata, Oracle, IBM, Microsoft, Netezza, Greenplum, Aster Data, and others continue  to scale up their platforms quite well  through parallel processing, indexes, partitioning, materialized views, query predicate pushdown, hardware acceleration, and other approaches. Nevertheless, we are already seeing traditional RDBMS vendors evolve into hybrid architectures that add columnar, in-memory, and other approaches. Oracle’s recent addition of hybrid columnar compression to its Exadata DW architecture is a noteworthy example, and SAP’s adoption of in-memory columnar in its Business Warehouse Accelerator. 

Let’s not mince words though: relational’s deployment roles aren’t growing. The familiarity, maturity, and standardization of relational technology are its strengths, but also its chief weaknesses, specifically as related to DW and BI. In a world where unstructured and semi-structured information is growing in importance, relational databases seem as if they are pushing their limits. Even in their traditional strongholds—such as DW and BI—relational databases are increasingly becoming marginalized to specific roles, such as master data management hubs. All the while, newer, more agile species—such as dimensional, columnar, inverted index, in-memory, and even flat file databases—slowly but surely come to the surface. 

What explains this somewhat gloomy outlook are the following traditional weaknesses of relational databases: 

  • Need for constant tuning. One of the chief arguments for row-oriented relational’s challenges is that it is fundamentally poorly suited for BI applications. In fact, most general-purpose enterprise relational databases are, where BI are concerned, neither fish nor fowl. They support OLTP but have also been tuned, to varying degrees, for reporting, query, and other read-intensive applications. That’s fine, but most of these databases were initially built and are still primarily used for OLTP--and it shows. Third-normal form and row-based storage remains suboptimal for many complex, multi-join queries (to a point that some BI tools can’t even support such structures). Yes, relational databases can be tuned for BI, but often require specific optimizations for each new application, query, and workload. This stubborn reality has driven the growth of both dimensional databases and the columnar database market.
  • Storage inefficiency. Another strike against OLTP relational databases is that their row-based storage is not storage efficient—or, at least, not as efficient as, say, column-oriented and inverted indexing approaches (though, for obvious reasons, more efficient than denormalized dimensional data structures). Rows often have mixed content and therefore are hardrer to compress, then columns which often have very similar content. Considering that storage costs are often the largest portion of data warehousing and BI capital expenditures, this is definitely a serious shortcoming. To their credit, relational database vendor continue to improve their compression efficiencies, but some might regard that as making the best of an otherwise mismatched technology.
  • Orientation around structured data. Yet another problem is that relational databases are ill-adapted for the new world of unstructured information types, such as those sourced from content management systems and the brave new world of blogs, social networks, complex events, sensor data, and streaming media. Yes, of course, many relational database vendors provide content extraction, transformation, and loading tools that can convert various and sundry non-relational formats—such as Extensible Markup Language—for storage in row-based repositories. But this often feels like a “square peg in a round hole” solution.

It’s clear that analytics-optimized databases are the wave of the future, and that the role of transactions-optimized databases for BI will be circumscribed more narrowly as time wears on. So what of the challengers to relational? How are they likely to fare in this new order of information management? Here are some Forrester quick takes on their respective roles: 

  • Columnar relational databases. Products from vendors like Sybase, Vertica, and ParAccel, are best for fast, complex, predictable queries against very large, structured recordsets persisted on servers. They compress well and require much less tuning then their OLTP cousins. Furthermore, frequent changes to a data model (such as dropping or adding columns) to adapt to ever changing business requirements is a synch with columnar. Relational is better, though, when there is a requirement for single-row lookups or flexible ad-hoc queries or joins across disparate tables. If your BI apps also have many single row lookups or applications that have “SELECT *”, columnar DBMS may not provide significant advantages.
  • In-memory data structures. Products incorporated into solutions from QlikView and TIBCO Spotfire—are best for flexible, ad-hoc, client-centric query, exploration, and visualization of relationships across disparate data sources and types without having to build a data model upfront, so called “post discovery” approach.. Relational is better, though, when there is need for fixed data models to support production reporting and queries. And, for obvious reasons, we still can’t load hundreds of terabytes into client random-access memory, so for now in-memory tools are best suited for departmental, work group or other specific use case data marts.
  • Inverted-indexing data structures. Products incorporated in solutions from Attivio, Endeca, and Microsoft FAST Search—are best for searches across one or more repositories of structured, semi-structured, and unstructured information types where there is a need for flexible metadata and hierarchies to support complex indexing, search, and classification. You should consider relational databases only when these diverse information types can be converted to familiar tabular structures, associated metadata, and query interfaces without violating their underlying semantics. However, bear in mind that inverted indexing is one of the most storage-efficient information-structuring approaches available, which may neutralize any advantages of force-fitting this information into a traditional database. In fact, Forrester believes that inverted-indexing makes sense as the underlying database for the transformation, cleansing, and staging layer in enterprise DW (EDW) environments that consolidate myriad information from diverse sources. However, consider that these technologies do not come without a cost. On inverted-indexing information stores, large enterprise administration features--such as backup/restore, security, and index rebuilding (without taking the application down)--are not as mature and robustness as on OLTP-optimized databases.

There’s a contrarian view too. None of these non-traditional databases are optimal for all roles in a traditional EDW ecosystem, such as handling mixed transactional/analytic workloads. So while Forrester sees a steady trickling of these new technologies in departmental and specific use-case scenarios, these are mostly installed and used in coexistence with, not as outright replacements for, your EDW. 

Clearly, there are ongoing roles for relational databases, as well as for their denormalized cousins—star and snowflake OLAP cubes--but they must be justified for each new application. Likewise, you should do the same—tailor the deployment to your specific requirements—for the non-traditional databases we’ve discussed in this post. Keep in mind that no one data storage, persistence, or structuring approach is optimal for all deployment roles and workloads. 

For example, no matter how well-designed the dimensional data model is within an OLAP environment, users eventually outgrow these constraints and demand more flexible decision support. By requiring that relational data be denormalized and prejoined into star schemas and other fixed, subject-specific structures, traditional multidimensional OLAP denies users the flexibility to drill down, up, and across data sets in ways that were not designed into the underlying cubes. 

The leading non-traditional database approaches--such as columnar, in-memory, and inverted indexing—may be more appropriate for the new generation of advanced analytics. However, just like their relational grand-daddy, no one of these upstarts is a one-size-fits-all solution. None are inherently optimized for all analytic applications, or transactions for that matter. 

In the big sprawling empire of your data management environment, each of these fiefdoms has its proper sphere. Columnar, in-memory, and dimensional make great sense in the front-end, as the enabler for fast query processing in data marts. Row-based relational is often optimal for multi-subject hub-and-spoke EDWs with complex, conformed hierarchies. Inverted indexing and file-based storage are ideal for storing, processing, and manipulating the complex information sets of the Web 2.0 world. 

What do you think? We at Forrester would love to hear your thoughts as we push inexorably into this new frontier.

Comments

Columnar is relational

Hi Jim,

Thanks for your analysis. Your post is insightful, as always. Thanks for the opportunity to share my perspective...

Firstly, you say "And columnar has steadily encroached on relational’s turf," and make other references to say that row-wise equates to relational (which implices columnar doesn't). Later you do mention "columnar relational databases", but mostly you equate row-wise to relational throughout your post. The way I see it, the relational nature of a database is within the logical design and the columnar nature of a database is down at the storage layer (and invisible). so therefore, columnar is indeed relational; the logical design, DDL and SQL layers are the same as with row-wise. (In fact, we've seen companies load their operational or third-normal-form schemas into our columnar db and run ad hoc, join-intensive analytic queries natively, without any redesign).

You also say "columnar has important advantages, especially when used to support structured, repeatable scan and queries against very large aggregated tables." Perhaps a nuance, but columnar's scan advantage isn't just for aggregated tables, in fact, columnar is often a replacement for aggregated tables because the scanning is so fast you don't need them (or the negatives that come with them).

Thanks for pointing out that table joining is an important capability. But, when you say that "Third-normal form and row-based storage remains suboptimal for many complex, multi-join queries (to a point that some BI tools can’t even support such structures)", we have seen that third-normal form isn't itself the the problem, it's typically due to row-orientation (and its inherent tuning) as well as insufficiency of the query optimizer.

Regarding your statment "Relational is better, though, when there is a requirement for single-row lookups or flexible ad-hoc queries or joins across disparate tables," my thought is there is some "apples vs oranges" going on here. The first part about single-row lookups is true, but the second part about ad hoc or joins across disparate tables isn't necessarily. Ad hoc implies that the query is unforeseen and, if so, can't be tuned for in advance. If the nature of the ad hoc query happens to take advantage of pre-existing tuning elements in a row-wise DB, it can possibly perform well, but it may perform poorly if there are untuned elements needed to answer the query. As for joins across disparate tables, this, again, is a function of query optimizer strength (and availability of pre-tuning elements such as materialized views). Our experience is that the more ad hoc and analytically complex a query is, the less advantage a row-wise database has.

Finally, please expand on "Row-based relational is often optimal for multi-subject hub-and-spoke EDWs with complex, conformed hierarchies." It seems to me that if the EDW is a true hub, then the multi-subject nature is irrelevent for querying because it's passive and the querying goes on in the surrounding analytic data marts. Conversely, if the EDW is intended for cross-domain querying, then, per your note, it would require tuning structures for proficient analytic querying. I agree with you that EDWs most often are row-wise, but I believe it is because row-wise was the predominant database orientation when many EDWs were being built, not because it has any inherent advantage with multi-subject or complex, conformed hierarchies.

Again, thanks for the opportunity to share my perspective and I look forward to your reply. I always enjoy your provocative discussion on analytic DB technology, and I hope my comments provide a catalyst for more.

Relational in this context is a shorthand term for .....

Kim:

Thanks for the great, detailed response.

I was using the term "relational" in this blogpost per common usage: referring to databases built on the modeling, persistence, and storage of data in tabular, row-based, third-normal-form. In other words, to RDBMSs as they are commonly understood. But a broader, looser definition of "relational," one could also include dimensional/cube and columnar. But that wasn't the point of the blogpost. It was to discuss how the need for speedier, more efficient execution of more complex queries is driving evolution and adoption of analytic databases, and how columnar and other non-RDBMS approaches are benefiting from that requirement. It was also to point out that all analytic database architectures have their pros/cons, that no single approach is likely to dominate, and that RDBMSs (row-based, 3NF) continue to evolve/optimize to rise to this challenge.

Jim