Predicting a landslide
"We predict The Tories will win by a landslide"
"Lies, damn lies and statistics"
Now, before I start, this is not a political article. No-one's left or right leanings will be discussed, and no politicians will be hurt in the making of this article.
With that disclaimer out of the way, let's talk 'predictive analytics' and the issues that can arise as you use these techniques.
Whether you like a surprise or not, we all want to know what's coming next in some form or another. From either wanting to know if it's going to rain in the morning, through to what the lottery numbers will be on the next draw, the future is what we want to know about.
In 'data' there are two ways to get to this, forecasting and prediction. Though some argue the different is nearly semantic, the verbiage for the two is often seen as different in the organisations we speak to. For most, forecasting is seen as using time series - historical data to plan at an aggregate level what will come next (think sales planning based off of previous years' data). Predictive, on the other hand, makes use of all of the data available to plan what individuals will do, based on that data.
Now, whichever you use, there are a couple of pitfalls you need to avoid.
Is "all of the data" actually "all of the data"?
One of the first ways in which you need to make sure your predictive model or forecast is working properly, is understanding your dataset. Quite often, without this understanding, you are going to get the wrong outcomes to your analyses.
For example, prediciting elections, or anything, off of qualitative data is a minefield. Firstly, is it all of the data, or part of the data? With the election coverage, one of the things I find most interesting with the pre-election polling is the small print, which is often not discussed. Within this, there is always a small word that’s all important, which is 'base'. Base is the sample size which the polling company spoke to when conducting the poll, and can be all important for this type of activity.
Obviously, the only way to truly predict exactly how an election will go is to ask everybody (and then hope none of them lie or change their mind, but more on that later). Now, obviously, this isn't feasible for this type of activity. As a result, they ask a sample of people, and then extrapolate out from there. For example, the last Ipsos Mori poll released prior to the election interviewed;
"a representative sample of 1,291 adults aged 18+ across Great Britain. Interviews were conducted by telephone 6th June – 7th June 2017. Data are weighted to the profile of the population (by age, gender, region, work status/sector, social grade, car in household, child in household, tenure, education (updated) and newspaper readership), and voting intention figures are based on all registered, and an adjustment for turnout overclaim based on age, and now also including tenure. As in all our final polls in recent general elections, we have reallocated refusals to the voting intention question on the basis of their newspaper readership. "
So, in effect they interviewed around 1.3k people, from a population of 65 million? This poll gave a vote percentage of 44% to the Conservatives (who actually got 42%) and 36% to Labour (who actually got 40%), giving the Conservatives a majority.
Now as the above (lengthy) quote above illustrates, when using a sample you need to control a number of factors to make it representative of the whole population, otherwise you risk skewing your data and arriving at false findings. As a result, though more respondents to a survey or panel make it easier to do this, which is why the exit poll, which surveyed 20,000 people, was nearly spot on. This is also how tools like comScore are able to be quite accurate with their data (they have roughly 2 million panelists in versions of their tools).
Did they actually get it wrong?
One of the other features that predictions and forecasts should have (or the good ones anyway) is a view of margin of error. With forecasting and predicting, this is the margin of error that they could be out, due to the sample size and their ability to control all of the variables within the data to prevent skew.
In the Ipsos Mori poll I quoted from earlier, there is one other small print comment, buried at the end of the sentence, which says;
"error bars a 4ppt margin of error"
This one statement says that 'our forecast could be wrong, and it could be up to 4 percentage points out either way'. This means though their forecast says the Conservatives will get 44%, it could be as low as 40%, or as high as 48%. For labour, the range is 32% to 40%. With this being the case, the outlying points on the forecasts were actually right, and the result was within the margin of error that their model provided.
Now there are many statistical ways to reduce margin of error (confidence levels etc), which I won't get into, but the easiest way is to, again, increase the size of your sample.
"Tell me lies, tell me sweet little lies"
One last point on Qualitative data. Whether people mean to or not, they often 'lie'. Be it that they actually mislead in their responses, through to them perceiving something as true which may not be. For pollsters, this gives a massive issue, which can be hard to manage. For businesses though, this can easily be counteracted by comparing your qualitative data to quantitative data, like web analytics.