Can you predict congressional elections using Tweets?
From time to time, I enjoy collecting tweets and trying to predict things with them. At Metis, I collected tweets to predict the opening weekend gross box office sales of new movies (blog post here). This time, I collected tweets about political candidates (specifcally ones with their full name in the tweet) in the month leading up to the 2014 US Congressional elections with the intent of predicting the outcome of the election.
I collected the tweets using Python, BeautifulSoup, Selenium and Chromedriver (script here). I collected tweets from all House and Senate races in 2014 (except a few that got lost in the shuffle). Lucky for you, with the data collection headache out of the way, we get to skip right to the fun part - the analysis and model building!
import pandas as pd
import numpy as np
import json
import codecs
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
race_metadata = pd.read_csv('~/election-twitter/elections-twitter/data/race-metadata.csv')
## here is the race metadata
race_metadata.head(3)
## each row has this info
race_metadata.loc[0].Result
That's JSON data that has each candidate, their party affiliation, and the percent of the vote they recieved.
## put a column in that grabs the winner out
race_metadata['winner'] = race_metadata.Result.apply(lambda x: json.loads(x)[0][0])
## split into train and test sets
race_metadata_train, race_metadata_test = train_test_split(race_metadata,test_size=.3,random_state=44)
How many races are we dealing with? you might ask..
race_metadata_train.shape, race_metadata_test.shape
So we will train on 308 races worth of tweets, and test on 133.
The first step will be to read in the tweets and get them into a nice tidy dataframe that we can easily compute statistics on and train models. We need to think about how to set up the prediction problem. Obviously, this is classification, since we are trying to predict choices (candidates). But, how many classes are there?
## how many candidates in each race?
race_metadata_train.Result.apply(lambda x: len(json.loads(x))).describe()
There are a variable number of classes - not something I've frequently encountered and not very straightforward to deal with. I thought the best way to deal with this is to "map" this problem down to a binary classification problem, in which each candidate has his or her own row which has the independent variables and the dependent variable - a 0 for if they lost their race, and a 1 if they won their race.
Mapping back up to predict the outcome of each race can be done in a few ways. First, I am not trying to predict the percentage of the vote each candidate will receive. That would be a different task. I am strictly looking at this from a classification point of view. With that in mind, the end goal could be to #1 pick the winning candidate for each race or #2 produce a probability that each candidate will win each race. For #1, we can simply use the output of the binary classification model by taking the maximium probability predicted for each race. For #2, we would have to do some more legwork. We can't simply take the output of the binary model and translate that into the probability that each candidate will win. For example, let's say the binary model spits out the following:
- Candidate 1 has a 55% chance of being a winner
- Candidate 2 has a 44% chance of being a winner
- Candidate 3 has a 33% chance of being a winner
--> there is no way of translating the above to what we'd want - chance Candidate 1 wins the race vs. Candidate 2 and Candidate 3 since the above probabilities are not even in the correct Event Space.
If we wanted to get #2, the probability each candidate will win each race, I think we would have build seperate models for each number of classes: One model for each race with 2 candidates, another for each race with 3 candidates, and so on.
I am going to go with #1, picking the winning candidate for each race, since I don't think the result from 2. will be worth the legwork.
For the binary classification problem, we would like a row for each candidate with his or her features, a boolean indicating whether or not they won their race, and a race indentifier so that later we can group by this identifier and take the maximum within the group to get out predicted winner.
def make_ascii(s):
return s.encode('ascii','ignore').decode('ascii')
def make_df(race_metadata):
values = []
for row_ind, row in race_metadata.iterrows():
try:
with codecs.open('/Users/adamwlevin/election-twitter/elections-twitter/data/tweets/%s.json' % (make_ascii(row.Race).replace(' ',''),),'r','utf-8-sig') as f:
tweets = json.load(f)
except FileNotFoundError:
print('Did not find %s ' % (row.Race,))
continue
for candidate,data in tweets.items():
record = [[]]*4
for date,data_ in data.items():
if data_:
data_ = np.array(data_)
for i in range(4):
record[i] = \
np.concatenate([record[i],data_[:,i].astype(int) if i!=0 else data_[:,i]])
values.append(record+[1 if candidate==row.winner else 0,row_ind])
return pd.DataFrame(values,columns=['tweets','replies',
'retweets','favorites',
'winner','race_index'])
df_train = make_df(race_metadata_train)
A few races - looks like senate races - seem to have gotten lost in the shuffle.
## take a look at the result
df_train.head()
## each row has vectors for the features:
df_train.iloc[0]
## a tweet
df_train.iloc[0]['tweets'][0]
## total number of tweets in the train set
df_train.tweets.apply(len).sum()
## which candidate had the most tweets about him/her
df_train.iloc[np.where(df_train.tweets.apply(len)==df_train.tweets.apply(len).max())[0]]
race_metadata[race_metadata.Result.str.contains('James Brown')]
This shows that using the name to search on Twitter is not an exact science. This guy, is the politician, but seems like most people tweeting "James Brown" are talking about the artist.
## now for some ML stuff
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
## This is useful for selecting a subset of features in the middle of a Pipeline
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, keys, ndim):
self.keys = keys
self.ndim = ndim
def fit(self, x, y=None):
return self
def transform(self, data_dict):
res = data_dict[self.keys]
return res
## Making some features about the text itself
class TweetTextMetadata(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, x, y=None):
return self
def transform(self, docs):
ave_words_per_tweet = [sum(len(tweet.split(' '))
for tweet in tweets)\
/len(tweets)
if len(tweets) else 0
for tweets in docs]
total_number_words = [sum(len(tweet.split(' '))
for tweet in tweets)
for tweets in docs]
ave_word_len = [sum(len(word) for tweet in tweets
for word in tweet.split(' '))/\
sum(1 for tweet in tweets
for word in tweet.split(' '))
if len(tweets) else 0 for tweets in docs]
total_periods = [sum(tweet.count('.')
for tweet in tweets)
for tweets in docs]
total_q_marks = [sum(tweet.count('?')
for tweet in tweets)
for tweets in docs]
return np.column_stack([value
for key,value in locals().items()
if isinstance(value,list)])
names = ['ave_words_per_tweet','total_number_words','ave_word_len','total_periods','total_q_marks']
## Making some features about the favorites, retweets, etc.
class TweetStats(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, x, y=None):
return self
def transform(self, df):
warnings.filterwarnings("ignore",
message="Mean of empty slice.")
total_replies = df.replies.apply(sum)
total_retweets = df.retweets.apply(sum)
total_favorites = df.favorites.apply(sum)
num_tweets = df.replies.apply(len)
ave_replies_per_tweet = df.replies.apply(np.mean).fillna(0)
ave_retweets_per_tweet = df.retweets.apply(np.mean).fillna(0)
ave_favorites_per_tweet = df.favorites.apply(np.mean).fillna(0)
ninety_eighth_percentile_replies = df.replies.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
ninety_eighth_percentile_retweets = df.retweets.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
ninety_eighth_percentile_favorites = df.favorites.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
return np.column_stack([value.values for key,value in locals().items() if isinstance(value,pd.Series)])
names = ['total_replies','total_retweets','total_favorites',
'num_tweets','ave_replies_per_tweet','ave_retweets_per_tweet',
'ave_favorites_per_tweet','ninety_eighth_percentile_replies',
'ninety_eighth_percentile_retweets',
'ninety_eighth_percentile_favorites']
## This inherits a TfidfVectorizer and just cleans the tweets a little before vectorizing them
## (this is probably unnecessary but haven't tested)
class CustomTfidfVectorizer(TfidfVectorizer):
def cleanse_tweets(self,tweets):
return ' '.join([word for tweet in tweets
for word in tweet.split(' ')
if 'http://' not in word
and 'www.' not in word
and '@' not in word
and 'https://' not in word
and '.com' not in word
and '.net' not in word])
def fit(self, x, y=None):
return super().fit(x.apply(self.cleanse_tweets).values)
def transform(self, x):
return super().transform(x.apply(self.cleanse_tweets).values)
def fit_transform(self, x, y=None):
self.fit(x,y)
return self.transform(x)
## This takes in a XGBClassifier and finds the optimal number of trees using CV
def get_num_trees(clf,X,y,cv,eval_metric='logloss',early_stopping_rounds=10):
n_trees = []
for train,test in cv.split(X,y):
clf.fit(X[train], y[train],
eval_set=[[X[test],y[test]]],
eval_metric=eval_metric,
early_stopping_rounds=early_stopping_rounds,
verbose=False)
n_trees.append(clf.best_iteration)
print('Number of trees selected: %d' % \
(int(sum(n_trees)/len(n_trees)),))
return int(sum(n_trees)/len(n_trees))
I am putting the names of the candidates in as stop words since it doesn't makes sense to use the name of any of the candidates to predict a generic candidate's fortunes. This is probably dealt with by setting min_df on the TfidfVectorizer reasonably, but I'll do it anywsay just to be safe.
names = [name_.lower() for result in race_metadata.Result
for name,_,_ in json.loads(result) for name_ in name.split()]
stop_words = names + list(ENGLISH_STOP_WORDS)
## I did grid search some of the below hyperparameters using grouped CV
features = FeatureUnion(
[
('tfidf',Pipeline([
('selector',ItemSelector(keys='tweets',ndim=1)),
('tfidf',CustomTfidfVectorizer(use_idf=False,
stop_words=stop_words,
ngram_range=(1,1),
min_df=.05))
])),
('tweet_metadata',Pipeline([
('selector',ItemSelector(keys='tweets',ndim=1)),
('metadata_extractor',TweetTextMetadata())
])),
('tweet_stats',Pipeline([
('selector',ItemSelector(keys=['replies','retweets',
'favorites'],
ndim=2)),
('tweet_stats_extractor',TweetStats())
]))
])
clf = XGBClassifier(learning_rate=.01,n_estimators=100000,
subsample=.9,max_depth=2)
X = features.fit_transform(df_train[['tweets','replies',
'retweets','favorites']])
y = df_train['winner'].values
cv = StratifiedKFold(n_splits=6,shuffle=True)
n_estimators = get_num_trees(clf,X,y,cv)
clf.n_estimators = n_estimators
clf.fit(X,y)
feature_names = sorted(['WORD_%s' % (word,)
for word in features.get_params()['tfidf'].get_params()['tfidf'].vocabulary_.keys()]) +\
TweetTextMetadata.names +\
TweetStats.names
Let's do some model introspection by looking at the prediction on the train set along with the outcome, and the most influencial features.
## predict train set
preds = clf.predict_proba(X)[:,1]
## make a plot
fig,ax = plt.subplots(1,1,figsize=(13,5))
plt.hist(preds[(df_train.winner==1).values],alpha=.5,
label='predictions for winners');
plt.hist(preds[(df_train.winner==0).values],alpha=.5,
label='predictions for non-winners');
plt.legend();
plt.title('Train Set Predictions');
## print top 10 importances and their names
importances = clf.feature_importances_
importances = {u:val for u,val in enumerate(importances)}
for ind in sorted(importances,key=importances.get,reverse=True)[:10]:
print(feature_names[ind],importances[ind])
So, the model seems to fit the train set pretty well and the top features (according to XGBoost's feature importance) are all from the Tfidf.
Now, let's do that group by the race index max on prediction to get the predicted winner of each race for the train set. This way we can get the accuracy of predicting the winner of each race.
## put the raw predictions in the dataframe so we can use df.groupy
df_train['pred_raw'] = preds
## get dictionaries mappying race index to index of predicted and true winners
preds = df_train.groupby('race_index').pred_raw.apply(lambda x: x.idxmax()).to_dict()
true = df_train.groupby('race_index').winner.apply(lambda x: x.idxmax()).to_dict()
## get train accuracy on race
acc = np.mean([preds[race_ind]==true[race_ind] for race_ind in df_train.race_index.unique()])
acc
Now let's test the model on the holdout set to see how it performs. I will first make the test dataframe, then produce the features, make the predictions and then make another plot and compute the accuracy.
## get test df
df_test = make_df(race_metadata_test)
## get test matrix and predictions
X_test = features.transform(df_test[['tweets','replies','retweets','favorites']])
preds_test = clf.predict_proba(X_test)[:,1]
## make a plot
fig,ax = plt.subplots(1,1,figsize=(13,5))
plt.hist(preds_test[(df_test.winner==1).values],alpha=.5,
label='predictions for winners');
plt.hist(preds_test[(df_test.winner==0).values],alpha=.5,
label='predictions for non-winners');
plt.legend();
plt.title('Test Set Predictions');
## put the raw predictions in the test dataframe so we can use df.groupy
df_test['pred_raw'] = preds_test
## get dictionaries mappying race index to index of predicted and true winners, this time on test set
preds_test = df_test.groupby('race_index').pred_raw.apply(lambda x: x.idxmax()).to_dict()
true_test = df_test.groupby('race_index').winner.apply(lambda x: x.idxmax()).to_dict()
## get test accuracy on race
acc = np.mean([preds_test[race_ind]==true_test[race_ind] for race_ind in df_test.race_index.unique()])
acc
As a final step, let's take a look at the race in the test set with the most candidates and see what the model did.
df_test[df_test.race_index==df_test.groupby('race_index').size().idxmax()]
The model predicted the top candidate as the winner and the top candidate actually won!
Stay tuned: as a next step, I plan to collect the tweets for the month leading up to the 2016 congressional races and see how this model performs on tweets from 4 years later.