I forget: what's in-memory?

By Boris Evelson
 

In-memory analytics are all abuzz for multiple reasons. Speed of querying, reporting and analysis is just one. Flexibility, agility, rapid prototyping is another. While there are many more reasons, not all in-memory approaches are created equal. Let’s look at the 5 options buyers have today:
 

1. In-memory OLAP. Classic MOLAP cube loaded entirely in memory

Vendors: IBM Cognos TM1, Actuate BIRT
Pros

  • Fast reporting, querying and analysts since the entire model and data are all in memory.
  • Ability to write back.
  • Accessible by 3rd party MDX tools (IBM Cognos TM1 specifically)

Cons

  • Requires traditional multidimensional data modeling.
  • Limited to single physical memory space (theoretical limit of 3Tb, but we haven’t seen production implementations of more than 300Gb – this applies to the other in-memory solutions as well)

 

2. In-memory ROLAP. ROLAP metadata loaded entirely in memory.

Vendors: MicroStrategy
Pros

  • Speeds up reporting, querying and analysis since metadata is all in memory.
  • Not limited by physical memory

Cons

  • Only metadata, not entire data model is in memory, although MicroStrategy can build complete cubes from the subset of data held entirely in memory
  • Requires traditional multidimensional data modeling.

 

3. In memory inverted index. Index (with data) loaded into memory

Vendors: SAP BusinessObjects (BI Accelerator), Endeca

Pros

  • Fast reporting, querying and analysts since the entire index is in memory
  • Less modelling required than an OLAP based solution

Cons

  • Limited by physical memory
  • Some index modelling still required
  • Reporting and analysis limited to entity relationships built in index

 

4. In memory associative index. An array/index with every entity/attribute correlated to every other entity/attribute

Vendors: QlikView, TIBCO Spotfire, SAS JMP, Advizor Solutions (also OEMd by Information Builders)

Pros

  • Fast reporting, querying and analysis since the entire index is in memory
  • Less modelling required than an OLAP based solution
  • Reporting, querying, analysis can be done without any model constrains, for example any attribute can be instantly reused as fact or as a dimension. Every query with an inner join can also show results of an outer join on every column.

Cons

  • Limited by physical memory
  • Some scripting/modelling still required to load the data

 

5. In memory spreadsheet. Spreadsheet like array loaded entirely into memory

Vendors: Microsoft (PowerPivot)

Pros

  • Fast reporting, querying and analysis since the entire spreadsheet is in memory
  • No modelling required
  • Reporting and analysis are as simple as sorting and filtering a spreadsheet

Cons

  • Limited by physical memory

 

What did I miss and what did I get wrong - comments?

Comments

Another for your list

Hi Boris
you mentioned that your list was complete
what about Yellowfin
www.yellowfin.bi

Comment

Yellowfin, are you serious, that is not even a real BI solution

some

Very good article that helps to clasify the technologies. I would say that inverted indices have proved to manage large volumes of data, because they allow compression (same for associative index) and parallelisation (see search technologies
And there are already live examples of big volumes of data with BW accelerator

I also feel that inverted indexes or associative enable new user experiences similar to search such as the one you get in qlikview, Spotfire or Polestar
Last, I fully agree when you say that associative technolgies is easier to model, which is very important for self service BI

In-memory analytics

Hi Boris

I have always found the analytics in BI tools to be very thin. There is very little ability to inspect causality or consequence. There is no doubt that drill down and drill through are powerful, but they don't really get to causality and consequence. They are much earlier in the discovery process, perhaps at the "symptom" level.

Are you aware of any process of industry specific tools that can do causality and/or consequence analysis? Perhaps it is just ignorance on my part.

Regards
Trevor

Trevor, there are two ways to

Trevor, there are two ways to look at it: probabilistic and deterministic causality. Probabilistic causality is the realm of statistical analysis, data mining, predictive modeling, etc tools which can find a relationship (or not) between one set of entities and attributes and another set (SAS, IBM SPSS, etc). Deterministic causality - is where among thousands or millions of entities, attributes or occurrences only one is related to another – is a much tougher challenge. The real problem here is that in order to find that one occurrence you have to have a data model that allows you to examine the relationship between these two elements or occurrences. But since you-don't-know-what-you-don't know that is always a problem. Therefore we call traditional analysis techniques that require data models “pre-discovery” vs. analysis that does not require any data modeling “post-discovery” (search for my blog with these keywords). Traditional relational data bases don’t do this well at all. Typical approaches that are used for this (still with limited success, but you can get better results than you’d get from relational models) are in memory arrays/indexes (QlikTech, TIBCO Spotfire, SAP BI Accelerator, Microsoft PowerPivot), inverted index DBMS (Endeca, Attivio), and associative DBMS (Saffron Tech, illuminate).

Hi Travor, I would recommend

Hi Travor,

I would recommend checking System Dynamics modelling tools for deterministic cause&effect modelling techniques. Stella and Vensim are the typical tools used with such modelling methodologies. Obviously BI tools are far behind in such advanced analytical modelling approaches.

Regards,

Mucahit

In-Moemory Analytics

Thanks for this Boris. Being an engineer who ended up studying OR and applied statistics at the graduate level, I get the difference between probabalistic and deterministic causality. Here's I blog I wrote along these lines:
http://blog.kinaxis.com/2009/12/i-am-adamant-that-an-accurate-forecast-d...
There is no doubt that probabalistic causality is the right way to go when a deterministic model is difficult to construct.

Working in supply chain management I find that "pre-discovery" is crucial for effective performance and risk management. Not only that, but predictive analytics are absolutely paramount, not just the analysis of history. In other words that a supplier has had a 75% on-time delivery over the past 3 months may be very useful to a contracts manager, but is of very little value to a production manager trying to satisfy customer demand. What the production manager needs to know is that a particular delivery from that supplier is late, but even that is of limited value. What the organization really needs is to be able to predict which customer orders/forecast will be delayed as a consequence and and the intermediate stes of production and distribution that need to be altered as a consequence. Not only that, the customer will want a revised promise date. This is the causality and consequence that I believe adds real value to an organization.

Don't get me wrong, I think there is tremendous value added by "post-discovery" analytics too, especially in determinign markets shift and buying patterns.

Regards
Trevor

What is and what should be

Trevor,

I appreciate your insight on the applicability of getting to the predictive side of things, anticipating the changes in the business processes given the anticipated blip in expectations so we can be value-add customer (and supplier) focused to forewarn of changes in original expectations made. This IS where we need to continue to drive the value-add of insights gained from effectively designed predictive models.

The historical-centric models of-course are a foundational basis from which prospective predictive models can be validated. Naturally the continual review of historical trends are important for "learning from history" and therefore need to continue to be considered significant (but not sufficient).

Refreshing times

What about refreshing times?
I mentioned IMA in power Pivot is very very slow.
It has to rebuild "the array" (isn't it the hidden OLAP?) every time it is refreshed.
Result is:
30 s for ODBC + OLAP
110 s for Power Pivot
Have You got the times (comparison) for the another applications?
http://afin.net/webcasts/Demo_RefreshData_ADO&AfinNet_vs_Excel2010&Power...
(real time - long)

Good point. Watch out for a

Good point. Watch out for a new blog I am working on with 5-6 criteria to evaluate when comparing in-memory analytics vendors like QlikTech, Spotfire and PowerPivot.

one item I'd add for the SAP

one item I'd add for the SAP BI Accelerator is that you still need to do scripting to get the data loaded into the in-memory structures from your source.

Boris, This is an interesting

Boris,

This is an interesting summary. While you are at this nice simple rubric, I would like you to also add: Cloud9 Analytics, PivotLink, Tableau, SSAS+Vertipaq, Oracle Exadata, Oracle OLAP 11g, Oracle Essbase, Vertica, Greenplum, HP Neoview, IBM DB2/DW on Power7+ Platform, Netezza and Birst into this mix as well. Almost all of them have in-memory caches for query acceleration. Some of them are multidimensional data stores and some of them are Relational with SQL extensions.

Lastly, it will be nice if there is a simple decision tree that can be derived from these various Pros & Cons. I realize that Forrester would not like to take sides, but a nice well written decision tree may be valuable to your readers.

Madhukar, I don't consider

Madhukar, I don't consider data caching directly related to the new trend of in-memory analytics. What's different is data caching basically moves the same data structures from disk to memory, but access, analysis methods are still designed as if data is on a disk. New in-memory techniques completely redesigned the way data is stored, managed and analyzed. Also, take a look at my latest blog - just posted yeterday - on more decision criteria.

Using Solid State Storage for in-memory

Boris,

Are any vendors using flash boards to support the caching of data from disk? This would allow more data to be analyzed economically.

Yes, some, like Teradata, but

Yes, some, like Teradata, but I do not equate that to in-memory analytics. Caching just moves data from disk to memory, but still uses memory the same way as disk in terms of data structures, reads, writes, updates, etc. So it's not as efficient or as "next gen" as true in-memory analytical databases. Check with my colleague, Jim Kobielus, who covers DW on which DW vendors are supporting flash.

difference of Caching vs in-memory

I'd be interested to understand the "secret sauce" that fundamentally distinquishes a memory caching solution vs an in-memory database. As a Data Architect practitioner, really feels like the same thing.

Mike, caching disk based DBMS

Mike, caching disk based DBMS data in memory still requires manipulating that data using methods optimized for disk access, and not necesserily the most optimal methods for when all the data is in memory all the time. The front or middle tier program doesn't know whether the data it's accessing is on disk or in memory. Programs for in memory databases are specifically written for optimized in memory data manipulation. Actually, in some cases a program that manipulates the data and the database itself is one and the same (this is true for QlikView, TIBCO Spotfire, AdvizorSolutions)