What Do BI Vendors Mean When They Say They Integrate With Hadoop

There's certainly a lot of hype out there about big data. As I previously wrote, some of it is indeed hype, but there are still many legitimate big data cases - I saw a great example during my last business trip. Hadoop certainly plays a key role in the big data revolution, so all business intelligence (BI) vendors are jumping on the bandwagon and saying that they integrate with Hadoop. But what does that really mean? First of all, Hadoop is not a single entity; it's a conglomeration of multiple projects, each addressing a certain niche within the Hadoop ecosystem, such as data access, data integration, DBMS, system management, reporting, analytics, data exploration, and much much more. To lift the veil of hype, I recommend that you ask your BI vendors the following questions

  1. Which specific Hadoop projects do you integrate with (HDFS, Hive, HBase, Pig, Sqoop, and many others)?
  2. Do you work with the community edition software or with commercial distributions from MapR, EMC/Greenplum, Hortonworks, or Cloudera? Have these vendors certified your Hadoop implementations?
  3. Do you have tools, utilities to help the client data into Hadoop in the first place (see comment from Birst)?
  4. Are you querying Hadoop data directly from your BI tools (reports, dashboards) or are you ingesting Hadoop data into your own DBMS? If the latter:
    1. Are you selecting Hadoop result sets using Hive?
    2. Are you ingesting Hadoop data using Sqoop?
    3. Is your ETL generating and pushing down Map Reduce jobs to Hadoop? Are you generating Pig scripts?
  5. Are you querying Hadoop data via SQL?
    1. If yes, who provides relational structures? Hive? If Hive,
    2. Who translates HiveQL to SQL?
    3. Who provides transactional controls like multiphase commits and others?
  6. Are you leveraging Hadoop HCatalog for Hadoop queries?
  7. Do you need Hive to provide relational structures, or can you query HDFS data directly?
  8. Are you querying Hadoop data via MDX? If yes, please let me know what tools are you using, as I am not aware of any.
  9. Can you access NoSQL Hadoop data? Which NoSQL DBMS? HBase, Cassandra? Since your queries are mostly based on SQL or MDX, how do you access these key value stores? If yes, please let me know what use cases you have for BI using NoSQL, I am aware of some search applications based on this, but not real analytics. 
  10. Do you have a capability to explore HDFS data without a data model? We call this discovery, exploration.
  11. As Hadoop MapReduce jobs are running, who provides job controls? Do you integrate with Hadoop Oozie, Ambari, Chukwa, Zookeeper?
  12. Can you join Hadoop data with other relational or multidimensional data in federated queries? Is it a pass-through federation? Or do you persist the results? Where? In memory? In Hadoop? In your own server?

As you can see, you really need to peel back a few layers of the onion before you can confirm that your BI vendor REALLY integrates with Hadoop.

Curious to hear from our readers if I missed anything.


Good questions... but what is the background?

the questions are good. Looking at it from a management perspective, I would have appreciated some context why to ask the specific questions. E.g. what Hadoop distribution are you offering? Is this a cost driver? Is there an architecture limitations in one distro or the other? Security concerns?

Hi, Alex, thanks for the

Hi, Alex, thanks for the post. These are all Hadoop and Hadoop commercial distribution questions, not BI/Hadoop integration questions, which is the only point I am trying to address here.

Hadoop keeps track of where

Hadoop keeps track of where the data resides ,there are multiple copy stores, data stored on a server.

Hadoop Development

very very good checklist!!! I

very very good checklist!!!

I have small comment for (7)
"(7) Can you access NoSQL Hadoop data? Which NoSQL DBMS? HBase, Casandra? Since your queries are mostly based on SQL or MDX, how do you access these key value stores? If yes, please let me know what use cases you have for BI using NoSQL, as I am not aware of any. "

this is not exactly correct, if I ignore you misspelled Cassandra and the fact that Cassandra is mostly targeting CFS

1. HBase is used for OLAP, if you would attend last HBaseCon you would see Adobe and their use cases.
2. There are few companies (still in stealth mode) which introduced this capability, I saw a demo. It is promising.

In general I agree, these stores are not in OLAP space, but can solve some OLAP use cases

Thanks very much for the

Thanks very much for the comments. Would love to see those NoSQL use cases as they go public or just want to share some info with me under NDA. What's CFS?

CFS = Cassandra File System

CFS = Cassandra File System http://www.datastax.com/dev/blog/cassandra-file-system-design

here is Adobe use case for OLAP, they do not have ODBC/JDBC or SQL interface http://www.hbasecon.com/sessions/low-latency-olap-with-hbase/

the rest... let's wait for HadoopWorld with the details ;-)

mdx for hadoop stack


is believed to have this functionality - i am yet to vet this -but may be worth while to check

Does anyone have a comparison matrix?

Boris -

This is a great topic that has been glossed over for too long. There is so much hype around connecting BI tools to Hadoop without any depth on the performance this will provide, lack of functionality, or if the BI tool is simply ingesting data from Hadoop and bringing it into a cache (and what scale/limitations may be associated with that).

Have you seen anyone create a list/matrix comparing the various BI toolsets, how they connect (e.g., via Hive), what functionality limitations there may be for machine-generated SQL, etc.?

It would also be interesting to see how BI vendors are looking at capabilities like Hadoop Catalog (HCatalog) to provide better access to metadata about how/where data is stored across various Hadoop projects.

(shameless plug warning) - Teradata Aster announced new connectivity to Hadoop data via Aster SQL-H ( http://www.asterdata.com/sqlh/ ) which leverage the metadata library of HCatalog to provide views and on-the-fly access to Hadoop data as if it's just another table to the SQL analyst or BI tool. We do this WITHOUT going through Hive translation which we think is a cleaner approach.

That aside, I think a more in-depth analysis would be a great topic for a research report if it doesn't already exist, with specific guidance on the pro's/con's of different techniques.

Thanks - Steve

Good point about HCatalog - I

Good point about HCatalog - I missed that one and now added it as an extra question to the blog. No, I am not aware of anyone doing that kind of a point-by-point comparison. I plan to, but can't commit to a specific time.

What about integration with Hadoop's HCatalog

We Teradata Aster integrate with HCatalog. That is a project that customers will need to consider.

learn more: www.asterdata.com/sqlh

Urgent help required- steps to connect Hive 0.8 to SAP BI 4.0

Could any expertise provide me with the steps to connect Hadoop Hive 0.8 to SAP BI 4.0, Your invaluable information to this request would be of a great help to me.

Getting data in

Boris, this is a great list. One thing I might add, however, is the collection issue. Many Big Data problems are new and the collection or instrumentation problem is every bit as challenging as the processing and consumption problems. So, I would ask, how does data get ingested into these solutions. Do you have to build a giant infrastructure to capture data and feed Hadoop? Can you trickle data in? Or is it a batch process? Does it require any additional tools or implementation? Can it be fed by a simple Restful API? Can it be fed directly from source applications, item by item? or do you have to create large batch infrastructure to load? I think the most successful Big Data solutions to date have been ones that either implicitly solved the data collection issue, or where that issue was largely already solved.

Just some thoughts...

Good point - adding a

Good point - adding a question to the list: how do you help your client get data into Hadoop files in the first place...