The Real Reason Why Google Flu Trends Got Big Data Analytics So Wrong

Unless you have just returned to Earth after a short break on Mars, you will have noted that some of the shine has come off the big data bandwagon lately.

Two academic papers that may have escaped your attention can help us to understand why - but also demonstrate that the naysayers are as misguided in their cynicism as are the zealots are in their naïvety.

Google Flu Trends (GFT) was once held-up as the prototypical example of the power of big data. By leveraging search term data – apparently worthless “data exhaust” – a group of Data Scientists with little relevant expertise were able to predict the spread of flu across the continental United States. In near real-time. At a marginal cost. And more accurately than the “experts” at the Centre for Disease Control with their models built from expensive survey data, available only after the fact.

Except that they weren't.

We now know that GFT systematically over-estimated cases - and was likely predicting winter, not flu. The first paper attempts to be even-handed and magnanimous in its analysis of what went wrong – and even succeeds, for the most part - but the label that the authors give to one of the mistakes made by the Google team (“Big Data Hubris”) rather gives the game away.

If revenge is a dish best served cold, then perhaps the statisticians and social scientists can be forgiven their moment of schadenfreude at the expense of the geeks who dared to try and steal their collective lunch.

Revenge aside, this matters. Because it goes to the heart of a debate about how we should go about the business of extracting insight and understanding from big data.

Traditional approaches to analytics – what you might call the “correlation is not causality” school - have emphasised the importance of rigorous statistical method and understanding of the problem space.

By contrast, some of what we might characterise as the “unreasonable effectiveness of data” crowd have gone so far as to claim that understanding is over-rated – and that with a big enough bucket of data, there is no question that they can’t answer, even if it is only “what” that is known, not “why”.

All of which is what makes Lynn Wu and Erik Brynjolfon’s 2013 revision of a paper they first wrote in 2009 so important. Wu and Brynjolfson also set themselves the task of leveraging search term data – this time to predict U.S. house prices – but instead of discarding the pre-existing transaction data, they used the data exhaust to create new features to enhance an existing model.

This is big data as extend-and-enhance, not rip-and-replace. And it works - Wu and Brynjolfson succeeded in building a predictive model for real estate pricing that out-performed the experts of the National Association of Realtors by a wide margin.

All of which might sound interesting, but also a little worthy and academic. What can we learn from all of this about the business of extracting insight and understanding from data in business?

Plenty. If you are a bank that wants to build a propensity-to-buy model to understand which products and services to offer to digital natives, then leverage clickstream data. But use it extend a traditional recency / frequency / spend / demography-based model, not replace it.

If you are an equipment maker seeking to predict device failure using “Internet of Things” sensor data that describe current operating conditions and are streamed in near real-time, you can bet that a model that also accounts for equipment maintenance and manufacture data will out-perform one that does not.

And if you are leading a big data initiative, you should prioritise integrating any new technologies that you deploy to build a Data Lake with your existing Data Warehouse, so that you can connect your “transaction” data with your “interaction” data.

Because if we are not to make the same mistakes as Google Flu Trends, then we need to face up to the fact that big data is about “both and”, not “either / or”.