Polling Meets Machine Learning
This post is going to be a little different than my normal ones. I have some thoughts about polling methodology that have been brewing for quite a while now..
The current state of polling is for polling companies to conduct interviews, get answers to the questions asked, and then turn this into point estimates of proportion of voters that favor a certain candidate or issue over the whole area in question - a state or the whole country. This normally involves some sort of weighting in order to make the sample that was polled more representative of whatever population the poll wishes to generalize to.
What each polling company is implicitly doing is making a model to predict the proportion of voters that will vote for a candidate or an issue on Election Day. This is the right strategy to follow because we observe some sample of intents to vote for something or another and we wish to make a prediction of how an unseen population of voters will vote. What I am getting at here is that this is just a machine-learning problem. With each poll there are attributes about the participant recorded, along with an outcome. X and y.
After building a model, we wish the make a prediction given a test set X-test that is the attributes of voters who will actually vote on Election Day. Knowing what the X-test is precisely before Election Day (what are the demographics of the people who will really vote) is itself its own problem, which can be handled with a combination of good data and statistics.
In this way, making a model to predict how each voter will vote is just like any other machine learning problem. It is made more complicated by the fact there is a time dependent aspect of the observations and the errors in predicting might be correlated with one another. FiveThirtyEight realizes this and bakes the potential for correlated errors into their simulations.
I think we need to take a fresh look at how we make these models. I think the real problem is that we are asking too much of polling companies. While some newer online polls can get 3,000 or 4,000 participants in a given state, the sample size of any one poll pales in comparison to the millions of people who will vote on election day. No amount of adjusting will help with the fact the model does not have enough data to be an unbiased estimator. Small sample sizes, combined with the fact that any one poll has data that is most likely skewed (landlines favors old people, online favors young people), make the job even more difficult.
The exposé on the methodology of the USC/LA Times Poll recently highlights the issue. This is a national poll that tracks 3,000 people. First of all, any Statistician or machine-learning practitioner when confronted with a complex real-world phenomena with hundreds if not thousands relevant predictors and is told that there are only 3,000 training examples will tell you to collect more data. How people split up over political issues is extremely complex and everything from region, age, income bracket, religion, gender and the interactions between all of theses variables are crucial for producing an unbiased model. The LA Times poll correctly has this intuition. That’s why it wanted to carve its already small sample into small categories like 18-21 year old men and use only this data to extrapolate about the broader population of this age group.
The problem, of course, is that there might be only 15 or so participants in the survey that are 18-21 year old men so using these fine-grained categories produces an estimator with high variance. It is like using a nearest neighbor algorithm with too few neighbors.
But, bickering about what the proper age buckets to divide the sample in misses the larger point. The polling companies are implicitly making estimators of the larger population but are doing so without testing. A crucial step to any model building effort is the holding out of data to see how a model generalizes to unseen data. So what I mean by this is holding out some of your sample and building a model that predicts best on the testing part of the data, using some sort of cross-validation strategy.
When one approaches this as an explicit machine-learning problem, the current method of polling begs the question, why is every polling company that conducts a poll responsible for making their own estimator? Wouldn’t a model that has access to all the raw data that is observed across all the polls all over the country be much more powerful?
Now, you might be thinking, why is taking an average of the estimates of polls not just as good as making a model that has access to all the data? Isn’t the ensembling of models a hot topic in machine learning because it produces such accurate predictions? Isn’t Nate Silver’s methodology of aggregating polls, weighting them according the their methodological prowess, good?
Well what FiveThirtyEight has done for polling is great. But, it can do better. Here’s what I mean. At a high level, poll aggregators are acting like one big ensemble model. Averaging the estimates of many weak estimators to produce better predictions. The trouble is that each weak estimator is being built with only a fraction of the overall data. With such a small amount of data, making distinctions between an African American 24-year-old in Michigan and an African American 35-year-old in California is much trickier and might be hopeless.
A concrete example: Gradient Boosting Machines are all the rage nowadays in the statistics and machine learning. The idea is to train successive weak learners on the mistakes that were made by the previous learners. It is a truly state-of-the art algorithm in machine learning. Gradient Boosting Machines have many hyperparameters to tune that are often crucial to the performance of the model. One of which is often called the Subsample rate. This is the proportion of the data that is randomly drawn (without replacement) from the dataset that is available for each learner to train on. The idea here is that if there is some variety in the data that each learner has access to, this produces learners that are less correlated with each other and therefore the overall model has less variance. It is a form of regularization.
This value is usually tuned using cross-validation. Typical optimal values that are found are between .5 and 1. Meaning that each weak learner has available between 50% and 100% of the data to train on. With the way the current ensembling of estimators is being done in polling, this parameter is being set to a tiny fraction - like .005 or 0.5%. The problem with this is that we are trading off too much bias for the reduction in variance.
If a model had access to enough data, there might be all sorts of real and interesting interactions between region, religion, race, and income that might be learned. This would allow a model to have less bias than the current standard allows for today. More and more polls release their raw numbers to the public - for example. However, there is probably more incentive to conduct polls if the organizations that conduct them have the opportunity to grab a headline by publishing their own estimate of the outcome. This is ok, but I believe the real contribution of each poll should be the dump of the raw numbers to a repository of data that one large model has available to it. Perhaps then ensembling estimates could be more along the lines of different model builders who have their own methodologies and each have access to all the raw data who each contribute their own estimates. Maybe this could even be a Kaggle Competition!
Maybe this sort of standardized system is still somewhat of a pipe dream. The growing hunger for accurate numbers and frustration over polling makes me think that we will one day get to a system that looks more like this.