Data scientists don’t work in isolation. As with any scientists, they rely on a wide range of people in adjacent roles to help them do their jobs as effectively as possible.
Think about science generally. In the historical development of modern science, the specialization of roles continues to proliferate. But today’s professional science establishment is a relatively recent phenomenon. Back in the Middle Ages — and even well into the modern era — scientists often had to be jacks of all trades in order to carry on their investigations. Until the 19th century, there were few professional scientists, research universities, or commercial labs. There were no eager, underpaid graduate students to press into service. Until the 20th century, most professional scientists had to build and maintain their own laboratories, invent and calibrate their own instruments, painstakingly record their own observations, and concoct and promote their own theories.
Today’s professional scientists — of which data scientists are a key category — have it much easier. Whether they work with particle accelerators or linear regression models, scientists know they don’t need to be their own chief cooks and bottle washers. They can make science their day job and rely on a host of others for all of the necessary supporting tools and infrastructure. We find the following broad division of labor in all of today’s scientific disciplines, including data science:
Data scientists are a curious breed. The term encompasses a wide range of specialties, all of which rely on statistical algorithms and interactive exploration tools to uncover nonobvious patterns in observational data.
Who belongs in this category? Clearly, the “quants” are fundamental. Anybody who builds multivariate statistical models, regardless of the tool they use, might call themselves a data scientist. Likewise, data mining specialists who look for hidden patterns in historical data sets — structured, unstructured, or some blend of diverse data types — may certainly use the term. Furthermore, a predictive modeler or any analyst who builds fact-based what-if simulations is a data scientist par excellence. We should also include anybody who specializes in constraint-based optimization, natural language processing, behavioral analytics, operations research, semantic analysis, sentiment analysis, and social network analysis.
But these jobs are only one-half of the data-science equation. The “suits” are also fundamental. Any business domain specialist who works with any of the tools and approaches listed above may consider him- or herself a data scientist. In fact, if one and the same person is a black belt in SAS, SPSS, R, or other statistical tools, and also an expert in marketing, customer service, finance, supply chain, or other business specialties, they are a data scientist par excellence.
Both of these skill sets are fundamental to high-quality data science. Lacking statistical expertise, you can’t understand which are the most appropriate algorithms and approaches to make the foundation of your statistical models. Lacking business domain expertise, you can’t identify the most valid variables and appropriate data sets to build into your models around.
The big data universe revolves around this seemingly new role called “data scientist.” For IT professionals who are just now beginning to explore big data, the notion of a data scientist may seem a bit trendy, hence suspect. How does it differ from such familiar jobs as statistical analyst, data miner, predictive modeler, and content analytics specialist?
Yes, data scientist is a trendy new job title to emboss on your business card. But it’s also a very useful new term for referring to a wide range of advanced analytics functions that heretofore have had no consensus category label. The term recognizes that advanced analytics developers, like scientists generally, spend their careers exploring new data for powerful insights that may not be obvious on first glance.
Indeed, one might define a data scientist as someone who uses statistical algorithms and interactive exploration tools to uncover nonobvious patterns in observational data. This definition is broad enough to encompass a wide range of data scientists doing various types of analyses against many data types. The tools may be usable by any intelligent person, or they may be so specialized and abstruse that you practically need a Ph.D. in higher mathematics to get started. The underlying algorithms may be limited to the most common multivariate regression approaches or may include the latest advances in artificial intelligence and machine learning. The exploration may be highly visual, or it may also involve trial-and-error iteration through complex statistical models.
Customer experience is becoming the new currency of business success. If you make quality of experience the centerpiece of your customer relationship management (CRM) strategy, you will be creating a sustainable business asset of substantial value.
Customer experience has qualitative and quantitative returns, as I will discuss next month at Forrester’s Business Process (BP) Forum. For a detailed discussion of customer experience optimization, also take a look at this recent Forrester report that I authored. You can measure the qualitative business return on customer experience in, dare I say it, love. Hopefully, your customers love the multichannel experience you provide, and, as a consequence, seek to deepen and extend the relationship. The concomitant of that is the quantitative return, summed up by a single word: money. If you’re making customers happy, hopefully that translates into sales, profits, renewals, referrals, and other bottom-line boosts.
That’s all well and good, but how can you directly translate love — i.e., quality of experience — into money, measure the impact, and calculate the return on your investment in experience-boosting technologies?
CRM next best action platforms are the key to realizing this promise. CRM next best action environments shape experience through embedded analytics that guide all interactions and offers across all customer-facing channels, processes, and roles. In addition to predictive analytics and business rules management systems, enterprises often incorporate into their next best action initiatives such experience-boosting investments in decision automation, sentiment analysis, conversation management, dynamic case management, knowledge management, and social networking.
We’ve all been through this many times before. So when will it be?
The same thing happens every time. Some shiny new thing gets built up until it’s too big for its britches and then we delight in shooting it down. Or taking it down a few notches until, chastened, it accepts its less-than-lofty position in the divine order of all things IT.
Hadoop is no fad, but it is definitely getting set up for a sober reappraisal — possibly by this time next year, or as soon as a significant number of major EDW vendors roll out their Hadoop products and strategies. I’ve already painted in broad brushstrokes the milestones that Hadoop needs to pass to be considered truly ready for enterprise prime time. I’m reasonably confident that it will meet those challenges over the next two to three years. I’m even willing to meet the open-source absolutists halfway on their faith that the Apache community will be guided by some invisible hand toward a single market-making distro with universal interoperability, peace, love, and understanding.
But even if Hadoop stays on track toward maturation, we’re likely to see the inevitable backlash emerge, spurred by the widespread impatience that usually follows overweening hype. The snarkfest will come as analytics pros start to realize that, promising as this new approach may be, there are plenty of non-Hadoop EDWs that can address the core petabyte-scale use cases I laid out. Many IT practitioners will ask why they should pay good money for a new way of doing things, with all the concomitant disruptions and glitches, when they can simply repurpose their investments in platforms like Teradata, Oracle, IBM, and Microsoft.
What’s clear is that Hadoop has already proven its initial footprint in the enterprise data warehousing (EDW) arena: as a petabyte-scalable staging cloud for unstructured content and embedded execution of advanced analytics. As noted in a recent blog post, this is in fact the dominant use case for which Hadoop has been deployed in production environments.
Yes, traditional (Hadoop-less) EDWs can in fact address this specific use case reasonably well — from an architectural standpoint. But given that the most cutting-edge cloud analytics is happening in Hadoop clusters, it’s just a matter of time — one to two years, tops — before all EDW vendors bring Hadoop into their heart of their architectures. For those EDW vendors who haven’t yet fully committed to full Hadoop integration, the growing real-world adoption of this open-source approach will force their hands.
Where the next-generation EDW is concerned, the petabyte staging cloud is merely Hadoop’s initial footprint. Enterprises are moving rapidly toward the EDW as the hub for all advanced analytics. Forrester strongly expects vendors to incorporate the core Hadoop technologies — especially MapReduce, Hadoop Distributed File System, Hive, and Pig — into their core architectures. Again, the impressive growth in MapReduce as a lingua franca for predictive modeling, data mining, and content analytics will practically compel EDW vendors to optimize their platforms for MapReduce, alongside high-performance support for SAS, SPSS, R, and other statistical modeling languages and formats. We see clear signs that this is already happening, as with EMC Greenplum’s recent announcement of a Hadoop product family and indications from some of that company’s competitors that they have similar near-term road maps.
Problems don’t care how you solve them. The only thing that matters is that you do indeed solve them, using any tools or approaches at your disposal.
When people speak of “Big Data,” they’re referring to problems that can best be addressed by amassing massive data sets and using advanced analytics to produce “Eureka!” moments. The issue of what approach — Hadoop cloud, enterprise data warehouse (EDW), or otherwise — gets us to those moments is secondary.
It’s no accident that Big Data mania has also stimulated a vogue in “data scientists.” Many of the core applications of Hadoop are scientific problems in linguistics, medicine, astronomy, genetics, psychology, physics, chemistry, mathematics, and artificial intelligence. In fact, Yahoo’s scientists not only had a predominant role in developing Hadoop but — as exploratory problem-solvers — they are active participants in Yahoo’s efforts to evolve Hadoop into an even more powerful scientific cloud platform.
The problems that are best suited to Hadoop and other Big Data platforms are scientific in nature. What they have in common is a need for analytical platforms and tools that can rapidly scale out to the petabyte level and support the following core features:
Enterprises have options. One of the questions I asked the firms I interviewed as Hadoop case studies for my upcoming Forrester report is whether they considered using the tried and true approach of a petabyte-scale enterprise data warehouse (EDW). It’s not a stretch, unless you are a Hadoop bigot and have willfully ignored the commercial platforms that already offer shared-nothing massively parallel processing for in-database advanced analytics and high-performance data management. If you need to brush up, check out my recent Forrester Wave™ for EDW platforms.
Many of the case study companies did in fact consider an EDW like those from Teradata and Oracle. But they chose to build out their Big Data initiatives on Hadoop for many good reasons. Most of those are the same reasons any user adopts any open-source platform: By using Apache Hadoop, they could avoid paying expensive software licenses; give themselves the flexibility to modify source code to meet their evolving needs; and avail themselves of leading-edge innovations coming from the worldwide Hadoop community.
But the basic fact is that Hadoop is not a radically new approach to processing extremely scalable data analytics. You can use a high-end EDW to do most of what you can do with Hadoop with all the core features — including petabyte scale-out, in-database analytics, mixed-workload support, cloud-based deployment, and complex data sources — that characterize most real-world Hadoop deployments. And the open-source Apache Hadoop code base, by its devotees’ own admission, still lacks such critical features as the real-time integration and robust high availability you find in EDWs everywhere.
Most Hadoop-related inquiries from Forrester clients come to me. These have moved well beyond the “What exactly is Hadoop?” phase to the stage where the dominant query is “Which vendors offer robust Hadoop solutions?”
What I tell Forrester clients is that, yes, Hadoop is real, but that it’s still quite immature. On the “real” side, Hadoop has already been adopted by many companies for extremely scalable analytics in the cloud. On the “immature” side, Hadoop is not ready for broader deployment in enterprise data analytics environments until the following things happen:
More enterprise data warehousing (EDW) vendors adopt Hadoop. Of the vendors in my recent Forrester Wave™ for EDW platforms, only IBM and EMC Greenplum have incorporated Hadoop into the core of their solution portfolios. Other leading EDW vendors interface with Hadoop only partially and only at arm’s length. We strongly believe that Hadoop is the nucleus of the next-generation cloud EDW, but that promise is still three to five years from fruition. It’s likely that most EDW vendors will embrace Hadoop more fully in the coming year, with strategic acquisitions the likely route.
Early implementers converge on a core Hadoop stack. The companies I’ve interviewed as case studies indicate that the only common element in Hadoop deployments is the use of MapReduce as the modeling abstraction layer. We can’t say Hadoop is ready-to-serve soup until we all agree to swirl some common ingredients into the bubbling broth of every deployment. And the industry should clarify the reference framework within which new Hadoop specs are developed.
It was just a matter of time. Aster Data, one of the most innovative startups in the enterprise data warehousing (EDW) arena, is moving rapidly into the ranks of leading vendors in this hotly competitive space. Just this morning, Teradata, one of the longtime EDW powerhouses, announced that it is acquiring San Carlos, California-based Aster Data. This $263 million all-cash deal, expected to close in the second quarter, will bring Aster Data’s well-regarded brand, exceptional team, growing product portfolio, and sophisticated intellectual property (IP) fully into Teradata.
For starters, the acquisition further substantiates several market trends that we called out in the recent Forrester Wave™ on EDW platforms:
Vendor consolidation proceeds apace. The EDW market has largely consolidated, though startup activity remains strong. Customer demand for one-stop shopping has driven consolidation and demand for completely integrated appliance-based EDWs, and, increasingly, for cloud- and software-as-a-service (SaaS)-based access to the same functionality. The past year has seen SAP acquire Sybase, IBM purchase Netezza, EMC buy Greenplum, HP announce its intention to absorb Vertica — and now this latest bombshell deal. Clearly, Teradata — the long-ago first mover in EDW appliances — is acquiring Aster Data in part for a strong appliance-based offering of its nCluster platform, which is architected for modular scaling of MapReduce operations.