No No No: The Curiously Absent Architecture of Postmodern Analytic Databases

Some say architecture is destiny. But it seems as if the world of analytic databases is moving on many fronts in all directions with no particular destination. That, in fact, is probably a good working definition of “postmodern.”

That’s the only reasonable conclusion you can draw from the sprawling mess of a movement known as “NoSQL.” It defines itself by what it is not, which would be easy to do if all it were rebelling against were something well-defined, such as the need to use Structured Query Language (SQL) to access, query, update, and manipulate data, or the need to store and manage that data in third-normal-form relational databases.

But no.  The NoSQL movement seems to fancy itself a catchall for all things “next-generation database,” though even there it is not clear what exactly they’re rebelling against. Check out this “NoSQL” scoping definition on the movement’s principal website: “Next Generation Databases [are] non-relational, distributed, open-source...horizontally scalable... modern web-scale databases....schema-free, replication support, easy API, eventual consistency, and more.”

Whew! If that non-definition didn’t leave you gasping for breath, the unruly menagerie of “No SQL” database approaches listed on the website will completely rob you of your oxygen supply. Apparently, this movement includes an unholy host of old and new database approaches, including wide column store/ column families, document store, key value/ tuple store, eventually consistent key value store, graph databases, object databases, grid database solutions, and XML databases.

And just when you thought they’d closed the lid on this overstuffed steamer trunk, the NoSQL folks, for good measure, jam in not just one miscellaneous segment--“other NoSQL related databases”--but the super-duper-plus-plus-miscellaneous category of “unresolved and uncategorized” databases.

In my book, none of that amounts to an architecture. And it’s not much of a “marketecture” either.  I’m inclined to think of it as more of an “anarchi-tecture”: an alluring void where an architecture should be.

I like to think of NoSQL as the Black Rebels Motorcycle Club of the database world. You know: the gang of which Marlon Brando’s character was the leader in the 1953 film “The Wild One.” That’s the flick where Brando’s Johnny Strabler responded to a girl’s question, "What're you rebelling against, Johnny?", with the immortal answer: “Whaddya got?"

If NoSQL is truly a rebellion against the database status quo, then its gang needs a more up-to-date roadmap of the landscape across which they’re cruising. It’s odd that they position their approaches as an alternative only to traditional, row-based, third-normal form, structured relational databases. What they’re missing is any mention of dimensional, columnar, in-memory, and inverted indexing databases (what NoSQL calls “wide column store/column families” is not the same as established columnar databases from Sybase, Vertica, and others).  In recent years, these have become the primary alternatives to relational databases in the enterprise arena, and in fact the deployment roles for relational continue to narrow. See my recent blogpost for a more detailed discussion of these alternatives.

But excuse me for picking on the NoSQL community. This issue of unfocused rebellion has bedeviled other segments of the analytics industry in recent years. When Rob Karel and I blogged last year on “Whatever Happened To EII?,” we were describing a movement that one might reasonably call “NoDW,” and which had by that time withered away from lack of any coherent unifying architecture.

Nowadays, most vendors in that space have elevator pitches that focus more on what they don’t do—providing platforms for building and optimizing DWs—than on what they do. As a result, they cruise collectively under no single banner name—not enterprise information integration (EII), not virtual DW, not data federation, not data virtualization, not data abstraction, not data mashup, not information as a service, not Web data services....not anything specific.

In a best practices report in late 2008, I attempted to capture the sprawl of use cases into which these “NoDW” technologies have been deployed. In the real world, they often supplement and extend DWs, rather than replace them outright. The best you can do is point them at any requirement for which two specific DW architectures--centralized or hub-and-spoke—are not optimal.

Feeling dizzy? If you’re a vendor in these markets, trying to define a clear marketecture and differentiation against a field of phantasmic architectural abstractions, I feel for you. If you’re a user trying to harness this hissing field of felines into a high-performance corporate resource, you’d better hold onto your trusty Clydesdales.

Until they mature , no “NoSQL” database is fit to function as your company’s primary analytic workhorse.

Comments

Good but a Bit Sweeping

Great post - thanks.

I'm not so convinced by the sweeping 'no “NoSQL” database is fit to function as your company’s primary analytic workhorse' conclusion. I would not say Hadoop is unproven. If it works for Amazon, Adobe, Adknowledge, AOL, Facebook, Google, IBM, NY Times, Twitter etc. then that's probably good enough for me.

"..but a bit sweeping"?

Steve:

Thanks. I don't understand why that's a bad thing, to be comprehensive in vision. Also, I don't understand the claim you're making for Hadoop. In which of these firms is a Hadoop-based system their core enterprise data warehousing platform?

Jim

Great Post

James

It was great chatting with you the other day with Boris.

Before I comment on your post, some background. I was a kernel engineer at Informix and Redbrick and I am very passionate about this topic. I have close professional relationships with ex-DB industry architects and engineers who are now social media companies and companies like Yahoo, Google, who are involved in analyzing large volumes of data. I am also very familiar with some of the new vendors like AsterData, Paraccel and Vertica.

I agree with your comments about the NoSQL movement. I think the movement is confused and mis-named. However, the problem the internet services are facing that need solutions, remain. Companies like Facebook and Yahoo are generating massive amounts of data. Their customers (the advertisers) as demanding ROI proof and traditional databases are falling woefully short of providing the analytics. Oracle, Sybase and DB2 were not designed for such workloads and fast turn around of analytics. As an example, I know of a company that is generating 30 TB of web data every day. Just the ETL on it takes over 1 week, assuming the process goes through successfully. It has got to a point where Informatica is falling short and people are using Ab-Initio for ETL.

Different people are approaching the problem differently:
- Some are using Hadoop - Facebook is an example.
- some are using MPP technologies (AsterData)
- some are creating proprietary technologies not exposed to the outside world for competitive reasons (Google)
- some are proposing BigTable - a fundamentally new way to store and retrieve information
- some are approaching it with columnar and compression (Vertica)
- some are trying to eliminate the disk based architectures and wanting to move to in-memory given the economics and that I/O latency of the disk is a non-starter at this scale. FusionIO is making good strides in coming up with such storage.
- some are using a combination of the above.

The above approaches are in different stages of infancy. To say that any of the above are doomed for failure would be like saying that a 3 year old will never succeed because they can't read or do algebra. However technical arguments can certainly be made to predict success and failure.

SQL can be too cumbersome to solve some problems. The parser layers of DBs like Oracle are so complex that the amount of time is takes to parse a SQL - let alone execute - makes a SQL based engine completely unfit for how Google serves up ads during search. Similarly row-storage models and typical SQL execution approaches will make querying 50 TB of data unfit for ad-hoc analysis.

I think the problem is characterized by the phrase "BigData". This is the problem that everyone is trying to address. To say that the solution is "NoSQL" however is shortsighted.

Interesting points

James

I agreed with a few bits of what you said, but I'm struggling with how you can be so dismissive in tone of a capability that many many companies are investing time and effort in and using to drive certain parts of their business.

I'm not sure of how important your statement that "no “NoSQL” database is fit to function as your company’s primary analytic workhorse" is because if Google, Facebook, Amazon and Twitter are all using NoSQL database technologies to drive large parts of their business then surely NoSQL is important because these companies wouldn't be able to do what they do without NoSQL would they?

I agree NoSQL doesn't have the nice "clean" standardisation and definition that exists in the SQL world, but to write a post that is very down on NoSQL is in my view a bit odd.

Regards,

Kass

Those companies use NoSQL DBMSs as their EDW?

Kass:

I'm not aware of any of those companies using a NoSQL as their internal enterprise data warehouse. If you know of any such case studies, I'd love to get more detail.

Jim

NoSQL

Jim

I don't know many companies are using NoSQL as their internal enterprize data warehouse and you could be right that it's zero, but I think that's missing my point a bit. My point is that the internal data warehousing is one application for databases, there are lots of other functions that require databases (e.g. search - Google, User information - Facebook, Tweets (140characters) - Twitter) and without NoSQL these guys would've struggled to be in business let alone using anything for an internal enterprize data warehouse.

Does that make sense?

Kass

Once again, read the blog and what I actually said

I summed up the thesis of the post in the very final sentence: "Until they mature , no “NoSQL” database is fit to function as your company’s primary analytic workhorse."

Clearly, the NoSQL market, to the extent that you can define one with any specificity (which I doubt) is immature. Nevertheless, I obviously appreciate the diversity of innovative database approaches in today's market, which is clear from this excerpt: "If NoSQL is truly a rebellion against the database status quo, then its gang needs a more up-to-date roadmap of the landscape across which they’re cruising. ....What they’re missing is any mention of dimensional, columnar, in-memory, and inverted indexing databases .... In recent years, these have become the primary alternatives to relational databases in the enterprise arena, and in fact the deployment roles for relational continue to narrow."

I'm covering NoSQL (however defined) as part of my focus on the virtualization of databases (old, new, emerging) for advanced analytics. Hadoop (HBase, HDFS, etc.) is important, as are the graph databases, and also the various other approaches. But, as I state in my blog, these diverse approaches have no common architecture, no common (meaningful) theme, no common apps, etc.

"NoSQL" is even more nebulous an industry theme than "cloud" or "social." Until the "NoSQL" community rallies around a cause more meaningful than aversion to SQL or RDBMSs, I'll cover the diverse "NoSQL" segments on their own merits, not for their connection to this spurious catch-all theme.