Big Data, Brewer, And A Couple Of Webinars

Whenever I think about big data, I can't help but think of beer – I have Dr. Eric Brewer to thank for that. Let me explain.

I've been doing a lot of big data inquiries and advisory consulting recently. For the most part, folks are just trying to figure out what it is. As I said in a previous post, the name is a misnomer – it is not just about big volume. In my upcoming report for CIOs, Expand Your Digital Horizon With Big Data, Boris Evelson and I present a definition of big data:

Big data: techniques and technologies that make handling data at extreme scale economical.

You may be less than impressed with the overly simplistic definition, but there is more than meets the eye. In the figure, Boris and I illustrate the four V's of extreme scale:

The point of this graphic is that if you just have high volume or velocity, then big data may not be appropriate. As characteristics accumulate, however, big data becomes attractive by way of cost. The two main drivers are volume and velocity, while variety and variability shift the curve. In other words, extreme scale is more economical, and more economical means more people do it, leading to more solutions, etc.

So what does this have to do with beer? I've given my four V's spiel to lots of people, but a few aren't satisfied, so I've been resorting to the CAP Theorem, which Dr. Brewer presented at conference back in 2000. I'll let you read the link for the details, but the theorem (proven by MIT) goes something like this:

For highly scalable distributed systems, you can only have two of following: 1) consistency, 2) high availability, and 3) partition tolerance. C-A-P.

Translating the nerd-speak, as systems scale, you eventually need to go distributed and parallel, which requires tradeoffs. If you want perfect availability and consistency, system components must never fail (partition). If you want to scale using commodity hardware that does occasionally fail, you have to give up having perfect data consistency. How does this explain big data? Big data solutions tend to trade off consistency for the other two – this doesn’t mean they are never consistent, but the consistency takes time to replicate through big data solutions. This makes typical data warehouse appliances, even if they are petascale and parallel, NOT big data solutions. Make sense?

What are big data solutions? We are giving some webinars on the topic to help you get answers:

These will feature material from my recent research, Expand Your Digital Horizon With Big Data, as well as from Big Opportunities In Big Data and many recent inquiries. 

Hope to speak with you there. Now, thanks, Dr. B...I need a brew.

Categories:

Comments

beer?

So you should buy me a beer sometime ;)

One thing I tend to think about in this space is that fast approximate answers are much better than perfect answers. The open question is the best way to bound the variance of your answers (literally) given eventually consistency.

Hi Eric...thanks for jumping in

Absolutely agree. My colleauge Boris, often states that with big data, 2+2 = 3.9 and that's good enough. Its a different mindset. I'll buy you that beer anytime!

Yes, Brian, 2+2=4 to a CFO

Yes, Brian, 2+2=4 to a CFO even if it takes weeks to process and arrive at that answer, and even if the process costs millions of $s. CFOs have no other choice, 2+2 HAS TO = 4. But if I am a CMO and I wake up in the morning to a new competitive threat, I need to address it right now. I can't wait for weeks. In that case 2+2=3.9 is good enough. So in business intelligence, analytics and especially in big data single version of the truth is relative and contextual.

Check out this dissenting opinion

http://www.dbms2.com/2011/09/11/big-data-has-jumped-the-shark/

I will say this - It's way too early in the game to disrespect anybody's opinion, which is why I'm posting Mr. Monash's here. He could be right, but let me clarify a few errors in his blog.

I didn't say that big data storage somehow violates CAP. I said that big data solutions trade off the C for the AP. That's the whole point of it, you can't have all three.

I totally don't understand his point about reducing the characteristics of big data down to one. Does anybody?

He could be right about big data also including MPP DW, but he misinterpreted my statement. What I said was that typical DW appliances are not big data. Why, because they are just throwing more power at traditional relational by doing some tricks with MPP. It's still business as usual.

Mr. Monash attempts to minimize the term big data, destroy it as it were. I embrace the term as a way to promote good dialog about an important advancement in our ability to do more with more data. Plus our clients need to understand the terms they hear and make crucial decisions. Simply destroying terms with no replacement lexicon to use for important new concepts is counter productive.

Lots of misreading going on -- or not

Brian,

I don't think I misread you about the CAP Theorem -- I just disagree. The term "Big Data" has been around for a while, and it pretty much started by referring to what can be and is done in MPP relational data warehouses (appliance-based or otherwise).

If you had merely said that "big data" now includes a lot of other stuff AS WELL, that would have been fine. But you're suggesting that what was originally meant by "big data" no longer is or should be called "big data". If that's what you think, then it's time to retire the term altogether. Giving a term a new meaning AND denying the old meaning that people still use just leads to massive confusion, confusion that I and numerous other analysts are already tired of cleaning up.

So Hbase (AFAIK) gives up P

So Hbase (AFAIK) gives up P for CA. Does that mean it is not part of "Big Data (TM)"? What happes to a M-R job if all replica's of a particular file (or portion therof) go down? Does the job finish or does it wait or error out? Unless it finishes (and this is not something I know the answer to ) Hadoop M-R is not AP either. IMHO (very gingerly given Eric Brewer has commented here) CAP is probably the wrong framework for Big Data as it applies to analysis (or EDW like systemss). There is a lot of work done (much of it in the streaming space) regarding approximate query answering which is a better framework for this space - at least in the direction you seem to be going

Hmmm....

I'll look into approx. query answering ... Tx. Re Hbase giving up P for CA, I need to relook at the architecture. That didn't jump out at me when I reviewed it. I'll ask Eric Baldeschweiler and Brewer to weigh in. Re your M-R question, I'm pretty sure it fails out, but CAP is theory, if your really worried you set replication to 5. I don't think this makes Hadoop not big data, just the are real world limits, right?

Let me rephrase what I was

Let me rephrase what I was thinking w.r.t. the HDFs qts. Consider doing sum() over some value stored in HDFS . In theory if a shard containing some rows is unavailable (i.e. all replicas are down) , Hadoop could return the sum() over the remaining rows . That seems to be a system that satisfies 'A' over 'C' - I say seems because frankly what many of us intuitively think of as 'C, A, and P' dont exactly correspond to what the Lynch paper proved. Heck, Mike Stonebraker seems to have got it wrong (See http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partiti...).

Your suggestion for increasing the replication factor is useful for reads but all that does is you are reducing the chance of 'P' during a read (and increasing the chance of 'P' during a write)? Frankly every EDW system has something like this - its not as if a node going down in a TD cluster means new queries stop returning - a replica can handle the load. How is this different in the CAP domain from what Hadoop does? DW systems that support dirty read also give up 'C'. Does that make them more suitable for Big Data?

I would suggest that giving up C in analytical systems has to be done carefully . This is what I suspect Eric Brewer's comment about 'bounding variance' is talking about. This is very different from Cassandra etc. where you are usually doing single row accesses (writes/reads etc.) and the app can view the conflicts on that single row and *potentially* do something sensible - at least in some applications. In an analytics system the client deals with a huge number of rows aggregated into a much smaller set. Making sense of inconsistencies over millions of rows is much harder - the hope is they cancel each other out but in theory they could actually add up making your aggregated answer useless

So big data is about making CAP tradeoffs

So, I think the point in all this is that big data forces some CAP tradeoffs. Perhaps not always C for AP, however. Whetehr MPP EDW appliance are big data or not. The jury is still out, but some strong points are made in this thread.

My point - when you have to start prioritizing Cs, As and Ps, you've got a big data problem.

PS - I meant to say Hbase

PS - I meant to say Hbase chooses CP over AP (not CA over AP) .

3Vs are from a 10 year old Gartner/Meta publication

Just to clarify the genesis of the 3V's concept others have laid claim to--a 2001 Gartner (then Meta Group) note I published entitled: "3-D data Management Challenges." Happy to send anyone a copy. A few years later my colleague Mark Beyer extended the concept into a comprehensive 12-dimensional model for "extreme data" factors. -Doug Laney, VP Research, Gartner

Thanks...glad to have that cleared up.

We love to have everyone contribute to these discussions, including you and other Gartner analysts. This raises the knowledge level of everyone. However, if you read the discussion, it’s not even about whether there are 4 V’s, or who ‘invented’ them. It’s about the CAP Theorem and what that means for the types of answers firms should look to Big Data for.