Learn From Vegas Casinos How To Get Smarter About Data Analytics

Ever wonder how Las Vegas casinos catch card-counting teams at Blackjack tables, like the MIT team immortalized in the film “21” with Kevin Spacey? They use many techniques, some of which are confidential, but one we know about is their use of Entity Analytics on many intersecting streams of information about their patrons or potential employees. I recently had the chance to learn more about Entity Analytics and Big Data from one of the top industry thought leaders, Jeff Jonas of IBM.

  

               Jeff Jonas               Kevin Spacey in 21

This opportunity came when Marcel Jemio, Chair of the Fiscal Service Data Stewards at the US Treasury Dept. (and a Forrester client), invited me to a presentation Jeff gave at a special internal event at the Fiscal Service in Washington, D.C. So of course I leapt at the opportunity! Marcel opened the session with an overview of why Treasury is interested in data and analytics: Treasury is charged with helping the nation guard against the kind of national or global financial collapse that triggered the 2007-2009 recession. Therefore it’s crucial that the stewards of the nation’s financial data, like Marcel and his colleagues, continuously improve the insights we gain from this data.

This data is more connected and interoperable all the time, across multiple public and private sector organizations with common goals. Making key insights from this data available more openly, but securely, increases transparency and visibility of potential issues to key decision makers in government and commercial enterprises. But to link all this related data, to gain these insights, requires the Fiscal Service to leverage global industry data standards to gain deep insights into integrated information. If you can’t link and reuse data, it’s much less valuable!

A Lifetime Spent Linking Data Together

Jeff invented a really cool way to link information called NORA (Non Obvious Relationship Awareness), and he’s implemented it multiple times on different platforms in different eras since the first version in the mid 80s, through the course of his career as an innovator, scientist, and entrepreneur. Jeff also found time to found Systems Research & Development (SRD), later acquired by IBM in 2005, which is how Jeff ended up as IBM Fellow and Chief Scientist of the IBM Entity Analytics Group at Watson labs.

Vegas casinos used a version of NORA to bring down that MIT card-counting team led by Bill Kaplan, through surveillance of live Blackjack play. Casinos routinely aim to connect eighteen or more different lists of people who are known to have defrauded or otherwise attempted to take advantage of the casinos, and they do this through the kind of entity matching that NORA enables. For example, I may connect information about a person who was arrested for a crime under an alias, with a person applying for a job as a croupier under her real name, because they both have the same phone number. Jeff also helped intelligence agencies deal with the “connect the dots” problems they faced after 9/11.

Making these connections requires an integrated view across observations from many different sources, including structured, semi-structured, and unstructured data, and even advanced sources such as video with facial recognition. Jeff pointed out that one flaw in conventional business intelligence tools is that they require smart people to ask smart questions, and only then can these tools give answers. There’s no way your organization has enough smart people to ask all the right questions all the time, so you need analytics that find relevant connections and bring them to your attention, telling you things you would otherwise never have known, such as the connection between the arrest record and the croupier’s job application. Entity Analytics are also quite valuable for developing richer "views of the customer" as well as for householding and other techniques crucial to success in the era of Digital Disruption.

Jeff used a story about jigsaw puzzle pieces to convey a powerful metaphor for linking information and observations. He has used groups of people assembling jigsaw puzzles to conduct experiments that reveal important insights about the way humans’ analytical thinking enables them to link pieces together to make a picture, just as analysts want to link disparate observations together to form a cohesive picture of an intelligence threat, to find a perpetrator after a bombing, or even to learn enough about you to make you offers that you just can't refuse. But Jeff's presentation happened the week before the Boston Marathon bombing, and when that happened I wondered what role NORA’s descendants might have played in analyzing video feeds and finding the bombers.

Unfortunately, Jeff sees many organizations getting dumber about their data – the algorithms they have developed to help them make sense of their data are not growing and innovating fast enough to keep up with the flood of new data from new sources, such as location data, which is a potential source of deep new insights. He calls this gap “enterprise amnesia,” and told the story of retailers that have been known to hire associates who were previously arrested for shoplifting – from the same store location!

Lessons Jeff Learned From A Lifetime Of Linking Data Together

  • Data is often imperfect – and that’s usually a good thing! You don’t need perfect information to find interesting relationships in the data – in fact, counter-intuitively, “dirty” data is sometimes better for finding relationships, because cleansing may remove the very attribute that enables matching. On the other hand, some information is a lie, as “bad guys” will intentionally try to fool you, or to separate their interactions with your firm into different channels (web, mobile, store) to avoid detection. You should assign a trust level to “known” information, and it rarely approaches 100%.
  • Your data can make you smarter as time passes. As new observations continue to accumulate, they enable you to refine your understanding, and even to reverse earlier assertions of your analysis based on what you knew at the time. Therefore, be sure to rerun earlier analyses over the full dataset, and don’t assume the conclusions of your previous analysis were correct.
  • Partial information is often enough. It’s surprising how soon you can start to see a picture emerge – with puzzles, the picture can often be identified with only 50% of the pieces, and this aspect of human cognition often applies to machine learning, too. Once the picture starts to emerge, you can more quickly understand each new puzzle piece (observation) by seeing it in the context it occupies among those around it.

This emerging picture should inform your collection efforts – you might need to obtain a new information source to follow up a lead from an earlier analysis, or to discard an information source (and the cost of collecting and analyzing it) once you realize it’s not helping.

  • More data is always good. The case for accumulating more data – Big Data – is strong: not only does it bring deeper insights, it also can reduce your compute workload – Jeff’s experience shows that the length of time it takes to link a new observation into a large information network actually goes down as the total number of observations goes up, beyond a certain threshold.

One of the most interesting new sources of Big Data insights is data about the interactions of people with systems – even their mistakes! That’s how Google knows to ask “did you mean this?”

  • Can you count? Good! Accurate counting of entities (people, cases, transactions), a.k.a. Entity Resolution, is critical to deeper analysis – if you can’t count, you can’t determine a vector or velocity, and without those, you can’t make predictions. Many interesting analyses in fraud detection involve detecting identities – accurately counting people, knowing when two identities are the same person, or when one identity is actually being used by more than one person, or even when an identity is not a real person at all… Identity matching is also the source of analyses that identify dead people voting and other such fraud.
  • Privacy matters, but it’s not an obstacle. Once identity comes into play, then privacy concerns (and regulations) must of course be taken into consideration. There are advanced techniques such as one-way hashes that can be used to anonymize a dataset without reducing its usefulness for analytical purposes.
  • Bad guys can be smart, too. Skilled adversaries present unique problems, but they can be overcome: to catch them, you must collect observations the adversary doesn’t know you have (e.g. a camera on a route, that they don’t know you have), or compute over your observations in a way the adversary can’t imagine (e.g. recognizing faces or license plates, and correlating that with other location information).

So as adversaries get smarter and more capable of avoiding detection all the time, savvy analysts must continually push the edge of the envelope of applying new techniques and technology to the game.

How To Stay Ahead Of The Game

Jeff pointed out that location data presents tantalizing new possibilities for insight. There are 600 billion location records created every day in the US alone! This data is being routinely de-identified and shared with multiple third parties, in volume and in real time, and it’s amazing what you can figure out from it. Consider the example of Malte Spitz, who as an act of political protest over his privacy concerns sued Deutsche Telekom for release of his location records. They revealed that over six months, he “hung out” 2400 times at 130 unique places. Know three of those locations – home (sleeps at night), work (goes in the daytime), and pub (goes to meet friends – links to other trails of location data) and I can tell you who the person is, despite the anonymized data – and who his friends are.

Although there’s a strong trend toward analyzing data in memory and delivering insights in real time – to inform “sense and respond” systems – don’t imagine that the world is going all real-time. Instead, Jeff advises that you should view batch approaches to analysis as an important complement, as delivering “periods of reflection” that can deliver insights that you can then use to improve the accuracy and usefulness of the model that drives your “sense and respond” systems. Jeff labels these two sides of the analytical world with catch phrases: “sense and respond“ (relevance finds the sensor) vs “explore and reflect” (relevance finds you). Jeff advises we use both sides together, which should inform future architectures for doing advanced analytics.

In contrast, today we do analytics in stovepipes – we have one set of algorithms to analyze structured data, different algorithms for unstructured data, and still more (different) algorithms for social data! Jeff believes that in the future we must take a more integrated approach to analytics, with algorithms that reason over datasets that mix all types of data, and link them all. It’s only through this broader view that we can do what casinos do, and catch the bad guys while they are still playing Blackjack.

What This All Means For You

Below find my take on how you should act upon Jeff Jonas’ insights, but I also urge you to engage with Forrester’s analysts who spend every waking moment thinking about business intelligence, Big Data, and the potential for deeper business insights that these and other innovations can bring: Boris Evelson, Martha Bennett, Mike Gualtieri, Noel Yuhanna, Michele Goetz, Brian Hopkins, and others. In my view:

  • Integrate your analytics stovepipes. Gaining deep insight requires a more integrated approach to analytics, bringing together all sources of information, whether structured, semi-structured, or unstructured (including media) into one pool of observations for analysis. This runs counter to the current practice in many organizations of more stovepiped approaches to analytics, so will require a major upheaval to accomplish, but it will be worth it for those that most require this kind of intelligence. The implications impact organization, staffing/skills, choices of technology, and architecture.
  • Integrate real-time and batch analytics for deeper insight. Both real-time and batch approaches are critical, and are also more complementary than many people realize. Although the need to act quickly on information that develops in real time (sense and respond) is the primary driver of the need to increase investments in real-time, the opportunity to inform batch analyses/models with new insights that are constantly emerging from real-time channels is an under-recognized source of added value that can help support the business case for real-time, just as insights from “deep reflection” via batch methods can inform and improve “sense and respond.”
  • Don’t be afraid of real-time. I was struck by Jeff's view that real-time may not cost more, as many expect it does. My own research, talking to people who are doing new work in-memory and using new technology like SQLStream or Streambase, or CEP, suggests that Jeff is right, that these innovative new ways of gaining insight often develop those insights much more efficiently than through other approaches that require cranking through the whole haystack, instead of reaching in and picking out just the needle you care about.
  • You need the right people to gain these insights. Transforming your approach to analytics will depend mainly on having the right people – as Jeff put it, you should hire “curious” people. In the future it will be more important for an analyst to be curious, even driven, than for the analyst to know SQL. These curious people will be seeing the emerging picture uncovered as data finds data – algorithms discover connections among many different observations – and using those insights to continually refine their analytical models and augment their sources with additional observations.
  • Beware the privacy and regulatory implications of integrating analytics. The value of combining information from multiple sources will motivate organizations that urgently require better insights from this data to consider how to obtain insights from the datasets they need without violating policies and regulations designed to protect the interests of citizens, while staying away from the legal jeopardy of a “fishing expedition.”

This opportunity to integrate multiple sources of insight is too important to our business success, good governance, and security, to let it go by. Be sure you enhance your strategy for analytics and business intelligence to exploit the opportunities that Jeff Jonas’ research and innovation shows us are real and compelling.

Comments

Nice work

Elegant piece of work here for a blog post -- thanks to @bevelson for the link or I may have missed, despite having followed Jeff's blog for quite a while.

The only item I can see to take any issue at all with that may present itself to me and not others is a bit of caution on the imperfect data assumption, and more is better, when combined with the necessity of curious people. Both are accurate, but they are also overdone in cultures that are attempting to protect the cash flow that has become data scientists.

While it's true that processing power, improved algorithms and unlimited data have really brought the cost down-- and quality up-- Google being a fine example, I hope no one is suggesting that the need for verification or data quality has evaporated.

It may not be as satisfying for the curious or as expensive for the customer, but most intelligence agencies, banks, and others would rather have a high degree of confidence of who they are dealing with, even if the curious within us should question it continuously with various methods. So I for one would like to see a bit more investment on quality upfront to make less sophisticated guesswork and investment necessary, especially since the vast majority of organizations lack the budget of a large Vegas casino working to curb large, direct losses due to fraud.

As mentioned, however, that is literally the only item I can find on this Friday afternoon. Thanks for sharing--good stuff. - MM

Re: imperfect data

Thanks for your feedback, Mark, and I think you have a good point, with which I certainly agree. But I think there is (or should be) a distinction between the level of quality expectation we have of data in different areas of the information management architecture.

So at the most raw level, data flows from its original sources into a "pool" (or "lake" - see Brian Hopkins' work on "Hub and Spoke" architectures for Big Data) where we just take it all as-is, and revel in the weirdness and completeness of it. Consider a big firm that delivers a lot of healthcare, this might be the collection of all electronic medical records as originally captured including doctors' handwritten notes, etc.

Then as we operate upon that data, we apply varying levels of structure and semantics, with corresponding differing levels of massaging for quality. At the lower level we want to know everything, at a higher level we may only want to know about stuff we can file an insurance claim about. And we may want that data formatted in a particular way which is a bit different from the way doctors put it in, but which conforms to what the insurance company (or Medicare) requires.

I think in the olden days we just had the "good" data and had to throw the rest away because we couldn't afford to keep it. Whereas now we can keep everything. And whereas when the insurance claim goes in, the insurance company has a very proscribed level of interest in what actually happened in the patient's interaction with the doctor, another part of the insurance company may be very interested in doing predictive analytics on a broader cross-section of data from the "lake," with a much looser structure and set of assumptions about quality, in order to gain a better understanding of the efficacy of different medical practices.