Social Network Analysis: The Fuse Igniting Enterprise Data Warehouse Growth. It’s Planet Petabyte or Bust!
Posted by James Kobielus on January 7, 2010
Social networks have always been with us, of course, but now they’ve gained concrete reality in the online fabric of modern life.
Social network analysis has, in a real sense, been with us almost as long as we’ve been doing predictive analytics. Customer churn analysis is the killer app for predictive analytics, and it is inherently social. It’s long been known that individual customers don’t always churn themselves—i.e., decide to renew and/or bolt to the competition—in isolation. As they run the continual calculus called loyalty in their heads and hearts, they’re receiving fresh feeds of opinion from their friends and families, following the leads of peers and influencers, and keeping their fingers to the cultural breeze. You could also make a strong case for social networking—i.e., individual behaviors spurred, shaped, and encouraged within communities—as a key independent variable driving cross-sell, up-sell, fraud, and other phenomena for which we’ve long built predictive models.
The other day, a Forrester client was asking me for educated guesses on how fast the average enterprise data warehouse (EDW) is likely to grow over the next several years, and as I was working through the analysis, I couldn’t avoid the conclusion that social network analysis—for predictive and other uses—will be an important growth driver (though not the entire story). I’d like to lay out my key points.
First off, I need to re-iterate, per my blog post from last month that social network analysis is much more than parsing a stream of tweets to see who’s flaming whom these days. At heart, it involves exploring the shifting web of relationships among people based on their profiles, interactions, and affinities.
Second, this definition clearly encompasses call-detail-record (CDR) analysis, which is a core telecommunications industry application of predictive analytics and data mining. We all know CDR analysis as the means by which carriers track our calls for billing, collection, usage monitoring, fraud detection, and other core operational requirements. Of course, CDRs also constitute a core data set that carriers leverage for sales, marketing, customer service, churn analysis, “friends and family” programs, and other key functions.
Third, CDRs are just one of many types of interaction, transaction, and behavioral records being leveraged by today’s online service providers, of which traditional telcos are just one category. Increasingly, customer-generated GPS and other geolocation data is becoming just as key for operational and predictive uses, especially for wireless carriers. Likewise, clickstream analysis is the lifeblood of personalization and customer experience optimization in Web 2.0 social networks, enterprise portals, clouds, and other online environments.
Fourth, CDRs, geolocation data, clickstreams, tweetstreams, audit log records, and other “event” data are beginning to flood into enterprise data warehouses (EDWs), where they are being aggregated for historical and predictive analysis—in other words, for social network analysis in the broader context discussed above. In fact, event data represents one of the most important new categories of information causing the EDW to balloon into the hundreds of terabytes and even petabytes. Another important new information category in the EDW is unstructured text. Some new information types—such as tweetstreams—straddle both categories: event data that is unstructured
Fifth, today’s vanguard of petabyte-scale EDWs—the “outliers”--tend to cluster in particular verticals—most notably, telecommunications and Web 2.0 pure-plays. In these verticals, which one should regard as the core of the new “cloud” paradigm, they’re used primarily for CDR analysis, customer churn analysis, next best offer, online experience optimization, fraud detection, and other applications that rely on social-network analysis.
Sixth, the growth of cloud computing in this decade, across all verticals, will create a huge demand for petabyte-scale EDWs to drive the social network analysis that is central to this way of doing business. The very large EDWs that today are vertical-specific outliers will, by the end of this decade, move into the horizontal, cross-industry mainstream. Where distributed analytical databases are concerned, we’re all skyrocketing toward Planet Petabyte.
Now, to close the loop on EDW sizing, here is the rough order-of-magnitude I like to use on such questions. Generally, Forrester breaks out key EDW sizing metrics into the following areas: storage, loading, and usage concurrency. As a rough estimate, approximately 90 percent of deployed data warehouses have storage capacity (raw, uncompressed data) under 10 terabytes (TBs), have loading capacity less than 1 TB/hour, and usage concurrency under 100 users.
Generally, we foresee average EDW capacities across all industries doubling every 2-3 years throughout this decade, with the primary gating factors being the cost of storage and the efficiency of compression. In other words, it won’t be as fast as Moore’s Law (i.e., doubling every 18 months), but more like every 24-36 months. In the early years of this decade, the annual EDW-capacity growth rate will probably be less 25 percent, but, with advances in storage, compression, and cloud technology/adoption, the annual growth rate will probably accelerate throughout the decade, reaching 200 percent by 2019. This is consistent with the “doubling every 24-36 month” average growth rate that I sketched out for the decade as a whole.
With those educated guesses and assumptions in mind, it’s plausible to forecast that, by the end of this decade, the average DW will between 10-40 larger than it is now—i.e., by 2020, 90 percent of EDWs will have a storage capacity in the 100s of TBs, with petabyte-scale EDWs common, with 10TB/hour loading the norm, and usage concurrency often in the 1,000s of concurrent queries/access.
Clearly, cloud-based storage is key to realization of this forecast. I’m working on a forthcoming Forrester report addressing the virtualization of EDWs into the cloud, and storage virtualization is a core technology. That report will be published in the next quarter or so. I’d love to hear your thoughts on all this.