Big Data: Lessons Learned from Google Flu Trends
Category : Blogs
Google Flu Trends was quietly retired back in 2014. Since then much has been written about the pitfalls of ‘Big Data Hubris’. The implied assumption that ever larger and more complicated datasets are an improvement on more traditional data analysis skills has some clear problems. The first of which lies in the imposed distance that separates the analyst from the act of data collection and from the behaviour that he or she is attempting to analyse.The greater the distance between you and your data; the more evidence falls though the cracks.
Correlation is not Causation
Every analyst worth his salt knows the golden rule: correlation is not causation. A quick browse of Tyler Vigen’s very entertaining website reveals that the age of Miss America winners correlates with murders by steam or hot vapour. That the divorce rate in Maine matches up with the per capita consumption of margarine.But the sheer volume of Google’s data distracted analysts from this well-established fact. It was assumed that the vast quantities of data involved, and the fact that this scale of human behavioural information hadn’t been available until now, would transform the nature of the analysis. That it would change the rules – and grant us the ability to model the near future like never before.
Size Doesn’t Matter (the same rules still apply)
But problems inherent in traditional small data analysis do not simply disappear when you scale the database up. If we drop all the talk of big data, algorithms and warehouses full of servers, the actual analysis that Google ran is very easy to describe:Google took a load of search terms (the fact that they used 50 million terms and not a hundred doesn’t drastically change things) and isolated those that correlated with historic rates of doctor appointments for flu-like symptoms.It is reported that Google had to weed out many terms that correlated but were obviously spurious (high-school basketball related terms, for example) in their effort to make the model into a functioning “flu-detector”. This should have rung alarm bells.And perhaps it did. The fact that Google never released the list of actual search terms that formed the basis of its model suggests that Google knew that these specifics would be open to criticism.Was Google aware of the problem but refusing to address it? Did they think themselves safe in the knowledge that the sheer volume of data with which they were working, would drown out any cause for concern?
The Big System Breaks Down
After its initial success and much media coverage, in 2009 the model completely missed an unseasonal influenza outbreak, and missed again, by a large margin, in the 2011-12 flu season.It’s not just the tried and tested analytical methodologies that are potentially obscured by big data. The system was not thrown off by media fuelled panic over bird and swine flu in the late 2000’s, so clearly its list of search terms wasn’t the only reason for its downfall.In fact, the problem may have arisen at the point of data collection. Harvard researchers have claimed that a likely explanation for Google Flu Trends’ failure was the ever-changing nature of Google’s own search algorithm.For example, supplementary suggested search terms were added in 2011, which arguably muddied the waters. Modifying the terms for which the system searched may have skewed the data, since it prompted the system to begin suggesting flu treatments for more general related searches in relation to the common cold. This resulted in the erroneously high estimations of flu that plagued the system in its later years.
What Have We Learnt?
Big is not necessarily better. Big data will not benefit you one bit, if your data collection is not consistent and robust. The size of your dataset doesn’t change the game. ‘Big data’ isn’t going to magically paper over the methodological cracks that you’d encounter in ‘small data’. Big or small – the same analytical rulebook still applies.For further reason on this topic read Harvard’s The Parable of Google Flu: Traps in Big Data Analysis.