NoSQL And Elastic Caching Platforms Are Kissing Cousins

The NoSQL Movement Is Gaining Momentum, But What The Heck Is It?

The NoSQL movement is a combination of an architectural approach for storing data and software products (such as Tokyo Cabinet, CouchDb, Redis) that can store data without using SQL. Thus the term NoSQL.

The idea is pretty simple: Not all applications need a traditional relational database management system (RDBMS) that uses SQL to perform operations on data. Rather, data can be stored and retrieved using a single key. The NoSQL products that store data using keys are called Key-Value stores (aka KV stores).

Because these KV stores are not relational and lack SQL they may be faster than RDBMS's because they don't have to maintain indexes, relationships constraints,and parse SQL. The downside of NoSQL is that you cannot easily perform queries against related data.

Bravo To the NoSQL Approach

As an analyst who focuses on helping clients achieve massive scale and blazing fast performance, I will be one of the first ones to endorse this approach for many Web applications because:

  • Scaling is easier. When data is not directly related to any other data you can store it anywhere. That means that you can handle more data by adding additional nodes.
  • The engines are faster. There is less overhead because the KV store does not have to parse SQL or maintain multiple indexes to support relationships. Often a hashing algorithm can be used to retrieve data instead of a more expensive B-tree type algorithm.
  • It is easier to change data structures. Need to add a field? No biggy.Many of these NoSQL products store data as blobs. If your data is stored as xml you may only need to add an attribute or tag rather than thinking about the impact of adding a field to a table in your database.

Many Web applications simply don't need to represent data as a set of related tables. Rather, data can be represented as an object graph or byte stream identified by a single key. For example, a user profile can be represented as an object graph (such as pojo) with a single key being the user id. Another example: documents or media files can be stored with a single key with indexing of meta data handling by a separate search engine.

Elastic Caching Platforms Are KV Stores On Steriods

Elastic caching platforms such as IBM eXtremeScale, Gigaspaces, TerracottaMicrosoft Velocity, Hazelcast, NCache, and Infinispan are essentially in-memory KV stores that provide most of the benefits of NoSQL KV Stores but add the following features:

  • Lower latency. These platforms store data in-memory. This significantly reduces the latency of data operations. In-memory storage is a downside though if you need to persist objects over time or have large objects such as video or documents.
  • Reliability. Distributed caching platforms employ clever data replication algorithms that store the data on multiple nodes. If one of the nodes goes down, the platform will serve the data from a backup node.
  • Scale-out. Most of the elastic caching platforms let you add and remove nodes during operations. The platforms use sophisticated algorithms to re-balance the data to optimize the use of all the nodes in the grid.
  • Code execution. Some, but not all, of the platforms also let developers distribute the execution of code across the grid. Using distributed code execution, developers can distribute the workload to where the data resides rather than moving the data to the application.

NoSQL Wants To Be Elastic Caching When It Grows Up

Platforms that often get labeled as NoSQL such as Apache Cassandra are closer to elastic caching platforms because they add many of the features of elastic caching technologies. Ultimately, the real difference between NoSQL and elastic caching now may be in-memory versus persistent storage on disk.

Because both are KV stores I predict the following:

  • Elastic caching will offer optimization for persistent data stores. Elastic caching platforms will include better features for customers who want the benefits of reliability through replication and scale-out but do not need the low latency of in-memory stores. Most products can read/write data from databases but needed to go through the in-memory cach first. For example, persist features would be a good approach for large objects such as media files or documents.
  • Many of the NoSQL platforms will grow up. Platforms generally associated with NoSQL will evolve to gain the reliability through replication and automatic scale-out features. Some will just remain superbad KV stores honed for the single purpose of single repository KV store.
  • Query and search across data in the KV stores will be the next big feature. Huh? I thought NoSQL and KV stores were for apps that didn't need much query and search. Where ever there is data stored, someone will want to query or search it. But, it is hard on data that is distributed. Try do a simple aggregate like counting the number of objects that meet a certain criteria. It is hard.
  • Code execution is next. As I mentioned above, many of the elastic caching platforms also offer distributed code execution. This lets developers run object code near to where the data is stored. Clever developers can implement map/reduce-like application to process large workloads without moving data around. Or, they can host services on the nodes.

Whoa. Did I just defined the characteristics of a database: persistent storage, query, and stored-procedures (code execution)? Back to the future?

Say "Yes" To Elastic KV Stores In Your Architecture

Enterprise application developers and architects should include elastic KV stores in their architectures because:

  • Achieve savings by reducing usurious RDMS licenses and maintainance.
  • Add scaling layer in-front of your databases or other data sources.
  • Improve performance of Web applications that store session and shared application data.
  • Elastic caching and cloud computing are a match made in heaven for app scaling in the cloud.

John Rymer and I plan to publish research on Elastic Caching Platforms (including a Wave) during the first part of Q2 2010. Look for it. If you have a NoSQL or elastic caching success story we would love to hear from you.

Mike

Senior Analyst

Twitter: mgualtieri

Coverage: Blazing-fast and massively scalable Web and application architectures, development, and user experience design

Comments

NoSQL? Is this mainframe redux?

For those of us who are old enough to remember mainframes, there was a time way back in the dimmest reaches of time when relational databases didn't exist. Back then you stored your data as a "blob" (or "record") with a key. You had to know where in the data a "field" (aka "column") was so that you could extract it.

Gee, maybe we weren't so backwards then...

Mike, this means you have to

Mike, this means you have to have at least two different DBMS platforms: one for transaction processing, one for BI. Do you see an issue with that?

Beating Brewer's CAP Theorum

Mike,

Thank you for your summary. We will indeed see rapid convergence of the replication and caching technologies as well as advances to reduce the limitations of simple Key-Value datastores. However, we still are faced with one of the oldest problems in computer science as neatly explained in Eric Brewer's CAP Theorum. The impossibility of having any one distributed database enjoying consistency, availability and partition tolerance together is the root of all the issues we see, and the compromises you have listed - be it in scale-out, performance, consistency and self-healing.

At risk of self-promotion, may I draw your attention to GenieDB (http://www.geniedb.com/). We are a London based venture born of a long-established web technology firm. GenieDB is just emerging from under the radar, and is indeed speaking at 'Under the Radar' on the April 16th. Our engineering solution to the CAP theory is achieved by not having just one database, but synchronously layering two databases together, each featuring different 'ideal characteristics': First we have an elastic in-memory sharded consistency layer, that delivers immediate consistency whilst updates are simoltaneously sent to a second persistent, fully replicated datastore, via a messaging subsystem.

CAP may be impossible but switching in real-time between C+AP certainly is, without breaking the laws of physics! Interoperability is also critical in helping developers adopt next generation technology so I share your opinion that richer SQL-like functionality must become the norm for NoSQL. We feel passionate, having built sites of every shape and size since the web began, that developers should not have to bear the burden of sudden changes to the way they work. Working around eventual consistency and primitive KV stores require considerable change to your typical, highly competent LAMP stack developer. So we hope that the new wave of ultra-efficient webscale database products will feature not only technological advances such as GenieDB but also work hard to co-exist with today's established database platforms. We have demonstrated this by building a 'NoSQL for MySQL' pluggable storage engine featuring intertable joins between GenieDB and InnoDB / InnoDB etc, critical for widespread uptake and migration in development communities.

This might not create world peace, but it'll certainly make the lives of application developers, DBA's, businesses end users a whole lot better, and improve how efficiently we scale both websites and the Web itself.

Dr Jack Kreindler (Founder, GenieDB)

PS Good luck with your Marathon... They hurt!

Another Wave of Scalable DBMS on the horizon...

Great post Mike.

Just a heads up that there are a bunch of new DBMSs that seem poised to hit the market this year (2010) that address the need for extreme scalability...with SQL and with ACID transactions.

Products include http://www.voltdb.com (lead by Postgres, Ingres, Illustra and Vertica inventor, Mike Stonebraker. VoltDB is in beta now), and also NimbusDB, Akiba and Basho, off the top of my head.

By year end there should be two categories of ultra-scalable data management solutions that scale horizontally:

KV stores (NoSQL) - like the products you mentioned in your post. Good for apps that:
a) don't need SQL or ACID transactions (OLTP)
b) do need scalability
c) do need support for dynamic schema (i.e., not relational)

Scalable OLTP SQL DBMS that for workloads that:
a) Do need SQL & ACID transactions
b) do need scalability
c) Need/prefer relational data schemas

There are probably other ways to organize the choices...but definitely lots of great innovation going on in database management!

-Andy

semi-structured data

The V part of a K,V storage can have structure inside it. For instance it can have name,type,value triplicates or a reference to one of a small set of data structure descriptors. Or it could be a dynamic typed dictionary or a serialization from any old language or according to any old scheme of choice. Heck, it can ever contain the code to unpack or do other operations on itself.

semi-structured data

The V part of a K,V storage can have structure inside it. For instance it can have name,type,value triplicates or a reference to one of a small set of data structure descriptors. Or it could be a dynamic typed dictionary or a serialization from any old language or according to any old scheme of choice. Heck, it can ever contain the code to unpack or do other operations on itself.

re: NoSQL And Elastic Caching Platforms Are Kissing Cousins

One thing that I found personally, being around this area of caching and nosql (before the term was even coined) is that once a nosql solution was chosen, probably one of the next steps is to add search to the mix (think of twitter without search...). This is why I think search should come hand in hand with a nosql solution, for example, ElasticsSearch (http://www.elasticsearch.com) is built just for that. The idea is that any change you make in the nosql solution is nearly immediately reflected in a search engine (real time search is all the buzz). For this, the search engine solution needs to be elastic as well in order to cope with the nosql solution.

Cheers,
Shay Banon (founder of ElasticSearch, and others...)

re: NoSQL And Elastic Caching Platforms Are Kissing Cousins

Mike

Great overview on the topic.

There is an excellent research that was published recently by Stanford University titled The Case for RAMCloud" (http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf) that provides more in-depth analysis of memory vs disk based solution:

"Numerous new storage systems have appeared in recent years to address the scalability problems with relational databases; examples include Bigtable [4], Dynamo [8], and PNUTS [6] (see [23] for additional examples). Un-fortunately, each of these is specialized in some way, giving up some of the benefits of a traditional database in return for higher performance in certain domains. Furthermore, most of the alternatives are still limited in some way by disk performance. One of the motivations for RAMCloud is to provide a general-purpose storage system that scales far beyond existing systems, so ap-plication developers do not have to resort to ad hoc techniques"

The research provides other interesting data points on the difference between memory vs disk based approach:

1. Memory vs Disk cost decreased by 360x over the last 25 years.
2. Even a 1% miss ratio for a DRAM cache costs a factor of 10x in performance
3. RAMClouds are 100-1000x more efficient than disk-based systems and 5-10x more efficient than systems based on flash memory

In my recent Dzone interview on NOSQL principles (http://java.dzone.com/articles/no-sql-alternatives) i provided few other aspects related to In-Memory vs disk based NoSQL approach that you may find useful.

Anyway I look forward for your upcoming wave report!

Cheers
Nati S
CTO GigaSpaces

re: NoSQL And Elastic Caching Platforms Are Kissing Cousins

You make a good point about the relationship with caching technologies. Many of those technologies exist because of performance or horizontal scaling limitations of traditional databases.

I think it's worth noting that not all the NoSQL products are pure key/value stores. For example MongoDB and CouchDB both support complex queries and secondary indexes. The two attributes that seem common across the nosql space are (1) no joins and (2) light transactional semantics. Those two things make horizontal scaling easy (and thus will make these solutions fit well with cloud computing).

We are also seeing a lot of developers use these solutions not just because of speed or scale but because of ease of the development. The more 'schemaless' nature of some of these products fits extremely well with agile development methodologies. Plus the document-oriented (JSON) stores really eliminate a lot of the object-relational impedance mismatch.

dwight/mongodb

re: NoSQL And Elastic Caching Platforms Are Kissing Cousins

I already posted about the powerful clustering and caching algorithms of the Papyrus Platform some time back. I was now surprised to read about combining NoSQL and Elastic Caching.

Papyrus uses both the same concepts on the lowest layer as well (and since V5 in 2000). But it is also a full application platform with a metadata repository, rule engine, on top of the distributed, object-relational database and transaction engine. It also employs a strict security layer and easy to use thick and thin-client GUI frontend.

It does not require enterprise application developers and architects to create applications with the above features as they are embedded in the Papyrus platform peer-to-peer kernel engine. Papyrus provides all the benefits you list without the need to go into highly-technical application development.

re: NoSQL And Elastic Caching Platforms Are Kissing Cousins

Adding to Dwight's comment, you also may want to look at NoSQL Graph Databases. Among other NoSQL advantages, some of which you are listing, graph databases are much closer (than any SQL database) to natively representing the kinds of object models that most enterprises think in. This takes a large expense out of the development budget.

Johannes/NetMesh/InfoGrid.org

re: NoSQL And Elastic Caching Platforms Are Kissing Cousins

Nice post Mike, the only thing I have issue with is the choice of name; I tried explaining in "140 characters or less" but let me try again.

Yes, Data Grids like many other elements of infrastructure software will have to be able to deal with elasticity but I would argue that elasticity is not a defining feature of this category anymore than it will be for Web Servers, App Servers, ESBs, etc. I think elasticity will become an intrinsic feature of infrastructure software - we won't be talking about it in 2 years time - well certainly no more than we talk about "high-performance" or "reliable" or "highly-available" today.

- Rich

re: NoSQL And Elastic Caching Platforms Are Kissing Cousins

Mike,
Unfortunately I have to disagree with you. First it is NOSQL (NotOnly SQL) not NoSQL. Second, it is not the support of SQL or indices that differentiates the domains but the CAP - Consistency-Availability-Partition aspects.
On the positive side, I agree with you on the latency and reliability issues. And the code execution is an interesting take - the adjancy semantics are interesting.
Finally, please research more before you venture into the report ... and add your ideas on what is there now ...
Cheers

re: NoSQL And Elastic Caching Platforms Are Kissing Cousins

Thank you all for your comments. Our Wave research on elastic caching platforms (a 6 month project to deeply evaluate 9 platforms based upon more than 100 criteria) was motivated by an increasing interest by Forrester's enterprise IT clients. I suspect that the same reason clients are interested in Elastic Caching (scale and performance) will get them interested in NOSQL. The motivation behind the interest in both is scale and/or performance. And, that is perfect because application must be architected for scale to take advantage of cloud.

re: NoSQL And Elastic Caching Platforms Are Kissing Cousins

Great topic.

Good points regarding Mongo and Couch in the comments. I have seen a lot of Mongo and Redis in production use lately.

Nice post Mike but you missed the "brother" in the KV-Elastic cache platforms: Memcached. Many of the above named elastic caches support or plan to support the Memcached protocol (e.g. Infinispan, Gigaspaces) because Memcached has become so ubiquitous in the web space and there are tons of memcached clients for apps/languages (e.g. Ruby, PHP, etc). The Memcached server is in wide spread use in data centers like facebook.com and there are now elastic memcached services running in the cloud (gear6 runs one on EC2).