“Big Data” Technology: Getting Hotter, But Still Too Hard For Most Developers

“Big Data” is coming up more often on the agendas of key vendors as well as some of the more-advanced users of information management technology. Although some of this increased activity reflects PR calendars – companies promote new offerings in the Spring – there’s more than that going on. The range of design patterns that fall under this large umbrella are genuinely on the increase in a wider range of usage scenarios, driving continuing innovation from both technology providers and users. In part because of the frequent use of open source technology such as Apache Hadoop to implement “Big Data,” this is the type of innovation the industry most needs at this early stage of the market. A few key data points:

  • I attended last week’s IBM “Big Data” Symposium at the Watson Research Labs, together with a number of other Forrester analysts and a number of IBM customers. Among the analysts attending was Brian Hopkins, who blogged about it last week. We saw a number of interesting examples of “Big Data,” including those cited by a users’ panel including Illumina, eZly, Carolyn McGregor, Ph.D., of the University of Ontario Institute of Technology, and Acxiom. We also heard how IBM had applied Hadoop inside Watson, which recently won at Jeopardy.
  • Other vendors of data analytics, warehousing, and integration technology have recently briefed Forrester analysts on their “Big Data” related capabilities, both current and planned. Many vendors embed the Apache Hadoop codebase into their solutions, and many of those also include proprietary forks of the Apache project to address requirements, such as real-time data integration and high availability, that the open-source project has yet to support..
  • Noel Yuhanna and I are working on the next Forrester Wave™ evaluating “information-as-a-service,” or data services, technology. For this project, we are interviewing firms with data services apps in production. More than one of these firms is using or plans to use “Big Data” (via warehouse appliances or Hadoop) to manage increasing volumes of content coming from web interactions or physical devices like oil wells, developing insights they then deliver in real time to consumers through integrated data services interfaces. They view their data services layers as a point of not only integration but also security and governance, and most have implemented canonical models as a key part of their data services strategy. Note that I’ll be blogging about my day at the second annual Canonical Model Management forum (also last week) in the near future.

What does it all mean?
That is the subject of much research from Forrester this year, not only from Brian and Noel but also from Jim Kobielus, Gene Leganza, and others. Here’s my quick take based on what I know today:

  • Experts place much of the focus of “Big Data” on “data-centric” use cases, as it should be – advanced analytics performed by experts in data and statistics, extending the insights firms, are gaining today beyond existing solutions like data warehouses or pioneering newer use cases that conventional technology is less well suited to conquer.
  • However, “Big Data” also matters to application developers – at least, to those who are building applications in domains where “Big Data” is relevant. These include smart grid, marketing automation, clinical care, fraud detection and avoidance, criminal justice systems, cybersecurity, and intelligence.
  • One “big question” about “Big Data”: What’s the right development model? Virtually everyone who comments on this issue points out that today’s models, such as those used with Hadoop, are too complex for most developers. It requires a special class of developers to understand how to break their problem down into the components necessary for treatment by a distributed architecture like Hadoop. For this model to take off, we need simpler models that are more accessible to a wider range of developers – while retaining all the power of these special platforms.

Making complex things more accessible to developers by evolving the development model is right in the sweet spot for our team that serves application development and delivery professionals. We’ve already begun to address this issue, at least in a general way, by defining the emerging elastic application platform (EAP), great new work from John Rymer and Mike Gualtieri that shows how “NoSQL” techniques for data management will evolve as part of a broader platform for apps built on private, public, or hybrid cloud architectures.

What is the industry doing to make “Big Data” easier for developers?
Some of the existing approaches for making “Big Data” platforms accessible to more developers work by bringing familiar APIs like SQL to bear. While this may be appropriate for some applications, SQL brings baggage, too – primarily that it can “lock” the data schema for the application, depending on how developers use it. But the most-flexible applications that work with unstructured content need to be able to dynamically evolve the data schema based on the data that’s showing up through the input content or streams.

For example, social web content that marketing mines for customer insights may evolve new kinds of information about new kinds of products or services, dynamically, at any time. Marketing pros can predict neither what these topics will be ahead of time nor what they will want to know about them – the structure evolves naturally from the content. Applications that work with unstructured data can benefit from this kind of dynamic schema evolution, and developers can work using Agile processes in such an environment, but they need a development model that is similarly dynamic to support their efforts.

From the data services/“Big Data” use cases we’ve seen so far, data services appear well suited to meeting this requirement. Developers can introspect (query) services at runtime to ask what information they have about which topics and then access that information for dashboards or other flexible and interactive means of visualization, or to inform other processes with analytical insight.

What do you think? Are data services potentially relevant to your use of “Big Data”?

Comments

Hello, Nice information

Hello,

Nice information shared, though it is very detail post but still I am unable to understand the 2nd point under the heading "What does it all mean?", so please if you can brief it bit more.

Thanks

What does it all mean?

Sure, let me expand on that point just a bit. "Big Data" usage is often ascribed to data-centric use cases like analytics. So, for example, a phone company might run an analytical job on the collection of all call-data-records (one per phone call made by each phone) and tie that back to locations, people, households, or other variables in order to shape their marketing strategy. Based on those analytics they might choose where to put another phone store, or which offers to make you when you call into the sales call center.

So what's the application developer relevance? Consider the smart grid example, the first one I cited where I've been seeing applications being developed that also use "Big Data." A utility company with a smart grid has a lot of data flowing into its systems from all over the grid, giving information about the real-time status of the grid, as well as all the endpoints - the businesses and homes that are consuming energy. The information includes a lot of details about how much energy gets consumed at each endpoint, at what time, and cost, in relation to the rest of the neighborhood, for example.

So one common type of smart grid app is one that enables a group of neighbors to compete with one another to see who can be the most efficient (and perhaps win prizes like special discounts). Normally you can't see your neighbors' data, so this requires a social-community angle, signing up to allow neighbors to see one anothers' data, in aggregate, compared to their own, as well as highlights of who's lowest, or made the most improvement, or whatever.

Apps like this already exist. And utility companies have found that people who might otherwise be competing to see who has the greenest lawn, or the prettiest flowers, are also motivated to compete to see who can save the most energy (in some neighborhoods, anyway). These kinds of app-based conservation programs are already showing far better results at lowering utility bills and conserving energy than any amount of preaching that "greens" might be doing.

Building such an app requires a developer to tap into the "Big Data" from the grid, but also build other application functionality on top of it. But most developers (not those who have been doing analytics, of course) are normally accustomed to using plain SQL to access a database, and unless they are pretty sophisticated Java developers, probably feel a bit mystified by Hadoop.

Note that there are some solutions already available. For example, Apache Pig is a high-level language for programming with Hadoop, so you don't have to write Java. It contains statements like:

instance_types =
LOAD '$INPUT/instance_types_en.nt'
USING pignlproc.storage.UriUriNTriplesLoader(
'http://www.w3.org/1999/02/22-rdf-syntax-ns#type')
AS (dburi: chararray, type: chararray);

instance_types_no_thing = FILTER instance_types BY type NEQ 'http://www.w3.org/2002/07/owl#Thing';
joined = JOIN instance_types_no_thing BY dburi, wikipedia_links2 BY dburi;
projected = FOREACH joined GENERATE wikiuri, type;

Not trivial, but much easier than the perhaps 10 times more lines of Java that would be required to do the same thing.

But most developers don't know Pig, either. So although it (and other things, like Hive) are a step in the right direction, they don't entirely solve the problem.

(The example of Pig code above came from: http://tinyurl.com/6bwmow7)

Great post, Mike, here are my thoughts

Excellent discussion, Mike. I'll add my two cents.

What's the current Big Data programmability model? It's equal parts model-driven, declarative, & procedural; SOA, REST, & MapReduce; . Java, C#, & LAMP; visual and code-oriented; and every traditional and emerging development paradigm thrown in. It's a mash of familiar and bleeding-edge.

Keep in mind that Big Data is fundamentally Big Data Virtualization. Treat all data, metadata, and other artifacts as services. Wrap everything behind abstractions. Use RDF and other specs to address semantic interoperability through higher-level vocabularies.

But be aware that Big Data, currently, is largely data-centric. The CRUD in the mud of Big Data is same as for small data: SQL. All the service-centric frontiers of Big Data go beyond that to incorporate metadata, models, higher-order abstractions, and the like, but, at heart, trusty ol' SQL (or vendor-proprietary tweaks such as SQL-MapReduce from Teradata/Aster Data) is still the onramp to much of this environment (because, after all, much Big Data is done in petabyte-scale enterprise data warehouses, which are, mostly, RDBMSs where SQL is the lingua franca).

What's the right development model for Big Data? It's a judicious blend of the familiar/proven and the advanced/powerful. As I've sketched out in this reply.

Not that you would but you could

Most of recent Big Data hype reminds me this Range Rover commercial: http://bit.ly/lealjo where the message is "So, while you may never storm Pikes Peak or own the passing lane in Munich, isn’t it nice to know you could?”

Yes, we could do many things with big data but until we see clearer definition of actual usecases the industry won't be able to make “Big Data” easier for developers. SQL is focused on relational/business data usecases and maybe focus on social media will help industry to define social media big-data abstractions that will be consumable by average developers. And utility grid will come up with different abstractions.

My approach here at GoodData is start with the user and build big data solutions for their needs and I see similar approaches in other social media analytics, game analytics and other companies. Open ended discussion about BIG DATA is like owning the passing lane in Munich. Not that you would but you could...

Range Rover

Thanks, Roman. Hey, would you mind telling my neighbors here in MD that they don't really need those pesky Range Rovers that keep clogging up the roads around here? They all seem to be prepared to storm Everest, or the Amazon, when they're really just on their way to Pilates.

Seriously, though, I think you have a good point. In looking at the question of "what the development model should be," we're certainly thinking about that while looking at certain application types / patterns. Perhaps the most common I've seen so far is when the business wants an app to change what it does based on the info flowing up from a Big Data resource.

One classic example is an app used by telco call-center agents that makes various suggestions on how to resolve issues, or on which offers to make, based on analytics updated from a real-time view of call-data history. The reason it's important to have that data up-to-the-moment is that it's often the case that when someone calls in, they were motivated to call by something that happened just a few minutes before.

Oil companies using upstream well data, utility companies using smart grid data, have similar apps (not call centers, of course, but using the data to inform an app). I haven't looked in enough detail at traffic management to know for sure, but it appears likely those apps work that way, too, although a lot of behavior in those systems is not human-facing, it's optimization of automated equipment like traffic lights or speed limit signs.

Use cases and integr patterns for development of Big Data apps

Roman:

For sure, the predominant Big Data use cases, app types, and integration patterns must shape the development paradigm.

One such use case that I'm focusing on is Next Best Action in Multichannel CRM. Increasingly, we're seeing massively parallel EDWs and even Hadoop clouds being deployed into CRM environments that capture, aggregate, filter, and analyze huge streaming social, geospatial, clickstream, and transactional data feeds (pushing into the petabytes) to drive targeted offers and customer experience optimization--by means of embedded automated "recommendation engines." The app types for Next Best Action are a combination of transactional, analytical, and orchestration. The integration patterns are various blends of centralized, hub-and-spoke, and federated. The "decision logic" modeling/development environments are a combination of predictive modeling, business rules management, and BPM.

In my research, I'm delving into the extent to which the industry might be integrating the modeling tooling to be able to specify all this Next Best Action decision logic in a more unified, visual tooling. I'm not seeing a lot yet, in terms of vendor support, but I've only recently begun to explore this angle.

Clearly, this goes to the heart of Mike's question of the optiimal development paradigm for Big Data.

Big Data -> Unbounded streaming data

"Big Data" isn't new - we've had to deal with large data for a long time. What's different now is providing tools for developers, and organizations, teams and users in general to effectively capture, visualize, and analyze the streaming data that is flowing through mobile, social media and other internet feeds. Users can craft analyses that then application developers can customize and 'scale out'. By putting similar tools in front of both developers and the end users the iterations between concept, rapid prototype, development, and use are much tighter and more effective.

At GeoIQ we've favored the geographic lens - humans have become incredibly adept at comprehending huge amounts of information in a spatial context as they can relate to both local and global phenomenon through recognizable boundaries (there's my house, the highway, the coast of the US, etc). By then providing easy to use tools to visualize data on a map and then ask questions and analysis through these data and quickly tweaking results to iterate and evolve the analysis they are able to make decisions and see the impact of those decisions.

Mobile application developers in particular need good tools to analyze the data from their apps and users in order to prove the footstream impact as it relates to their users and business model. Location and geospatial tools for analysis are fortunately getting much easier to use and integrate. I'm excited when I can start seeing the impact of these large analyses inside my applications in order to provide a better experience and understanding of the local context around me.