Big Data: Does It Make Sense To Hope For An Integrated Development Environment, Or Am I Just Whistling In The Wind?

Is big data just more marketecture? Or does the term refer to a set of approaches that are converging toward a common architecture that might evolve into a well-defined data analytics market segment?

That’s a huge question, and I won’t waste your time waving my hands with grandiose speculation. Let me get a bit more specific: When, if ever, will data scientists and others be able to lay their hands on truly integrated tools that speed development of the full range of big data applications on the full range of big data platforms?

Perhaps that question is also a bit overbroad. Here’s even greater specificity: When will one-stop-shop data analytic tool vendors emerge to field integrated development environments (IDEs) for all or most of the following advanced analytics capabilities at the heart of Big Data?

Of course, that’s not enough. No big data application would be complete without the panoply of data architecture, data integration, data governance, master data management, metadata management, business rules management, business process management, online analytical processing, dashboarding, advanced visualization, and other key infrastructure components. Development and deployment of all of these must also be supported within the nirvana-grade big data IDE I’m envisioning.

And I’d be remiss if I didn’t mention that the über-IDE should work with whatever big data platform — enterprise data warehouse, Hadoop, NoSQL, etc. — that you may have now or are likely to adopt. And it should support collaboration, model governance, and automation features that facilitate the work of teams of data scientists, not just individual big data developers.

I think I’ve essentially answered the question in the title of this blog. It doesn’t make a whole lot of sense to hope for this big data IDE to emerge any time soon. The only vendors whose current product portfolios span most of this functional range are SAS Institute, IBM, and Oracle. I haven’t seen any push by any of them to coalesce what they each have into unified big data tools.

It would be great if the big data industry could leverage the Eclipse framework to catalyze evolution toward such an IDE, but nobody has proposed it (that I’m aware of).

I’ll just whistle a hopeful tune till that happens.

Comments

Hi Jim. Great post. We call

Hi Jim.
Great post. We call the IDE you described "Greenplum Chorus". While it doesn't support everything on the planet right now, it's a good first step and is gaining momentum in the market.
MwM

Thanks

Mike:

Thanks. Collaboration, model governance, & automation tools for Big Data data-scientist teams. I forgot to list those as important Big Data IDE features in the blog. But they are essential. Clearly, we still--as an industry--need to clarify how such tools, should they emerge, support teams as well as individual Big Data developers.

I'll tweak the blog right now, while the thought is fresh.

Jim

R language

The R language covers all of those advanced analytics capabilities, and Revolution Analytics makes an IDE for R (http://www.revolutionanalytics.com/products/enterprise-productivity.php). The Big Data angle is key here. Revolution also adds some big-data algorithms to R (for multivariate statistical analysis and predictive modeling) that works with out-of-memory data sets. There are also interfaces to big-data platforms (Hadoop, Netezza, Greenplum etc.); in some cases after data processing at the data layer (feature extraction, aggregation etc.) the resulting data set is amenable to in-memory processing in R, filling out the rest of your grid.

R is one language for Big Data, not the only

David:

Agreed that R has a wide functional/algorithmic breadth and is suited to many advanced analytic projects. But a Big Data IDE would need to support multiple development languages--including but not limited to R--in order to be considered truly comprehensive.

Big Data requires comprehensive information management approach

Hello Jim - This fits in well with the briefing that we had last week and is one reason that SAS is promoting the idea of a comprehensive approach to Information Management - an approach that is supported by products and services that supports the entire data to decision spectrum, including complete support for the analytics lifecycle. As you noted, the capabilities should be exhaustive and the approach should handle any data requirement that an organization has. And since the architecture has to be flexible - you shouldn't be locked into a specific database technology - then providing an infrastructure and tooling that supports a "build once, deploy anywhere" philosophy is critical. This allows you to scale from a SMP to MPP to grid, and leverage the right mix of technologies, including Hadoop, based on the analytics requirements. It's also critical that the analytics can be embedded in real-time operational systems, and that the solution provides the requisite business rules, workflow capabilities that wrap the analytics result. And finally, leveraging an infrastructure that allows you to integrate various technologies - including the ability to manage the models, embed a combination of technologies in a data or analytics job workflow (e.g., SAS job that can seamlessly embed MapReduce capability) should be part of the overall solution.

Mark Troester
IT/CIO Thought Leader & Strategist
SAS
Twitter @mtroester

A Big Data IDE would need to plug into app/dev framework

Mark:

Per what you said re the need to embed analyticis in real-time operational systems with business rules, workflow, etc, it's clear that a Big Data IDE would need to plug into a broader IDE framework, perhaps Eclipse, to integrate analytics with transactional applications. It occurred to me that an equally over-broad notion, that of "Next Best Action IDE," would demand a Big Data IDE plus an SOA IDE etc. To the extent that the industry provides a Big Data IDE that address just advanced analytics in Hadoop, it's a much narrower scope that's easier for vendors such as SAS to package and deliver into well-defined use cases (and waiting customers). Best not to boil the cloud/ocean.

Jim

Boiling the ocean

I'm not sure we need an IDE that boils the ocean. That might get a tad top-heavy and cumbersome, especially when combining diverse things like optimization and CEP. But it's sure fun watching the vendors jump on "we can do (almost) everything" bandwagon!

Yes, but an open-source Big Data IDE framework is quite feasible

Wayne:

Yes, I agree. Which is why I ended the post with a call for an open-source Big Data IDE framework, perhaps Eclipse, to support pluggable development tools that data scientists and other Big Data developers can customize to their specific modeling and project requirements.

Jim

Better late than never

I just got to this interesting post-- thanks for sharing.

Certainly agree with the direction of Jim and Mark, but am fairly confident that most are not looking in the right direction in terms of extending advanced analytics to the information workplace. For example I think it's possible to do without R, although agree that it's the most popular. If you dig really deep you can find some published testing in labs of different methods that are beginning to look really interesting with the potential at least to impact the first generation of tools and applications currently in use.

If we are really discussing self-managed analytics across the enterprise-- as captured in the Forrester report in January Future of BI (which names Kyield as an interview with about 10 others as I recall), while we may not need to boil the ocean 24x7, we do need a carefully crafted data structure and data stream with adaptable data governance. I designed our CKO module to do just that in our patented systems approach to extending analytics to all knowledge workers in the organization. It's not at all intended to be a replacement for individual tools--rather we employ a platform approach based on universal standards for easy plug-in (and no lock-in), and we obviously must include security parameters for access that range from extremely tight to very loose. However, even in our pilot program currently under development it's certainly a revolutionary approach compared to anything we've seen or heard of to date -- original R&D is almost 15 years old now, original patent application in early 2006-- it took quite a while for underlying technologies to catch up to the vision, especially scaling in near real-time across large organizations. But the potential for improvement in how organizations perform is very interesting and exciting, especially relating to crisis prevention--a very old personal motivation, expedited discovery across disciplines, and enhanced innovation.

Thanks for the post and discussion.

Mark Montgomery
Founder & CEO
Kyield