Can you predict congressional elections using Tweets? Part 2
As promised in this post, I am back to predict all of the midterm elections from 2016 using the tweets from the month leading up to the election and a model trained on the races from the 2014 midterms.
Refresher: I used a script that queries the twitter advanced search feature for all the tweets that contain each candidates full name. I saved all such tweets that were written in the month leading up to the 2014 midterm elections and now for the month leading up to the 2016 midterm elections.
Now, I will build the same model that I built in the previous post except this time I will train on the 2014 races and evaluate on the 2016 races. Let's see how the model performs!
import pandas as pd
import numpy as np
import json
import codecs
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
race_metadata = pd.read_csv('~/election-twitter/elections-twitter/data/race-metadata.csv')
race_metadata_2016 = pd.read_csv('~/election-twitter/elections-twitter/data/race-metadata-2016.csv')
race_metadata.head()
race_metadata_2016.head()
## How many races in the train and test sets?
race_metadata.shape[0], race_metadata_2016.shape[0]
## put a column in that grabs the winner out
race_metadata['winner'] = race_metadata.Result.apply(lambda x: json.loads(x)[0][0])
race_metadata_2016['winner'] = race_metadata_2016.Result.apply(lambda x: json.loads(x)[0][0])
race_metadata.head()
race_metadata_2016.head()
## how many candidates in each race in the train set?
race_metadata.Result.apply(lambda x: len(json.loads(x))).describe()
## how many candidates in each race in the test set?
race_metadata_2016.Result.apply(lambda x: len(json.loads(x))).describe()
The below is the same code as last post with a couple modifications:
- the 2016 tweets are in a different directory so I added an argument to the function that modifies the path for the year.
- The 2016 race metadata has some data quality issues which I didn't notice until now - some of the candidate names are not actual people but really '–' or 'Write-Ins'. I am filtering these out since there are not helpful for our goal of predicting who will win a race.
def make_ascii(s):
return s.encode('ascii','ignore').decode('ascii')
def make_df(race_metadata,year=2014):
values = []
path = '/Users/adamwlevin/election-twitter/elections-twitter/data/tweets'
if year==2016:
path += '/t2016'
for row_ind, row in race_metadata.iterrows():
try:
with codecs.open('%s/%s.json' % (path,make_ascii(row.Race).replace(' ',''),),'r','utf-8-sig') as f:
tweets = json.load(f)
except FileNotFoundError:
print('Did not find %s ' % (row.Race,))
continue
for candidate,data in tweets.items():
if candidate in ('–','Blank/Void/Scattering','Write-Ins','Others'):
continue
record = [[]]*4
for date,data_ in data.items():
if data_ and data_!='Made 5 attempts, all unsucessful.':
data_ = np.array(data_)
for i in range(4):
record[i] = \
np.concatenate([record[i],data_[:,i].astype(int) if i!=0 else data_[:,i]])
values.append([candidate]+record+[1 if candidate==row.winner else 0,row_ind])
return pd.DataFrame(values,columns=['candidate','tweets','replies',
'retweets','favorites',
'winner','race_index'])
## make the train set and test set
df_train = make_df(race_metadata)
df_test = make_df(race_metadata_2016,year=2016)
## take a look at the result
df_train.head()
df_test.head()
## who has the most tweets of the 2016 candidates?
df_test.loc[df_test.tweets.apply(len).idxmax()]
Now, I will do the same model building procedure as last time. There are three classes of features: metadata about the language within the tweets (i.e. average number of words per tweet), metadata about the tweets (i.e. average number of replies per tweet), and a tfidf vectorizer built using the words as tokens treating all of the tweets about a candidate concatenated together as a document (i.e. tfidf score of the word "congressman"). The model is an XGBoost classifier.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
## This is useful for selecting a subset of features in the middle of a Pipeline
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, keys, ndim):
self.keys = keys
self.ndim = ndim
def fit(self, x, y=None):
return self
def transform(self, data_dict):
res = data_dict[self.keys]
return res
## Making some features about the text itself
class TweetTextMetadata(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, x, y=None):
return self
def transform(self, docs):
ave_words_per_tweet = [sum(len(tweet.split(' '))
for tweet in tweets)\
/len(tweets)
if len(tweets) else 0
for tweets in docs]
total_number_words = [sum(len(tweet.split(' '))
for tweet in tweets)
for tweets in docs]
ave_word_len = [sum(len(word) for tweet in tweets
for word in tweet.split(' '))/\
sum(1 for tweet in tweets
for word in tweet.split(' '))
if len(tweets) else 0 for tweets in docs]
total_periods = [sum(tweet.count('.')
for tweet in tweets)
for tweets in docs]
total_q_marks = [sum(tweet.count('?')
for tweet in tweets)
for tweets in docs]
return np.column_stack([value
for key,value in locals().items()
if isinstance(value,list)])
names = ['ave_words_per_tweet','total_number_words','ave_word_len','total_periods','total_q_marks']
## Making some features about the favorites, retweets, etc.
class TweetStats(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, x, y=None):
return self
def transform(self, df):
warnings.filterwarnings("ignore",
message="Mean of empty slice.")
total_replies = df.replies.apply(sum)
total_retweets = df.retweets.apply(sum)
total_favorites = df.favorites.apply(sum)
num_tweets = df.replies.apply(len)
ave_replies_per_tweet = df.replies.apply(np.mean).fillna(0)
ave_retweets_per_tweet = df.retweets.apply(np.mean).fillna(0)
ave_favorites_per_tweet = df.favorites.apply(np.mean).fillna(0)
ninety_eighth_percentile_replies = df.replies.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
ninety_eighth_percentile_retweets = df.retweets.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
ninety_eighth_percentile_favorites = df.favorites.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
return np.column_stack([value.values for key,value in locals().items() if isinstance(value,pd.Series)])
names = ['total_replies','total_retweets','total_favorites',
'num_tweets','ave_replies_per_tweet','ave_retweets_per_tweet',
'ave_favorites_per_tweet','ninety_eighth_percentile_replies',
'ninety_eighth_percentile_retweets',
'ninety_eighth_percentile_favorites']
## This inherits a TfidfVectorizer and just cleans the tweets a little before vectorizing them
## (this is probably unnecessary but haven't tested)
class CustomTfidfVectorizer(TfidfVectorizer):
def cleanse_tweets(self,tweets):
return ' '.join([word for tweet in tweets
for word in tweet.split(' ')
if 'http://' not in word
and 'www.' not in word
and '@' not in word
and 'https://' not in word
and '.com' not in word
and '.net' not in word])
def fit(self, x, y=None):
return super().fit(x.apply(self.cleanse_tweets).values)
def transform(self, x):
return super().transform(x.apply(self.cleanse_tweets).values)
def fit_transform(self, x, y=None):
self.fit(x,y)
return self.transform(x)
## This takes in a XGBClassifier and finds the optimal number of trees using CV
def get_num_trees(clf,X,y,cv,eval_metric='logloss',early_stopping_rounds=10):
n_trees = []
for train,test in cv.split(X,y):
clf.fit(X[train], y[train],
eval_set=[[X[test],y[test]]],
eval_metric=eval_metric,
early_stopping_rounds=early_stopping_rounds,
verbose=False)
n_trees.append(clf.best_iteration)
print('Number of trees selected: %d' % \
(int(sum(n_trees)/len(n_trees)),))
return int(sum(n_trees)/len(n_trees))
names = [name_.lower() for result in race_metadata.Result
for name,_,_ in json.loads(result) for name_ in name.split()]
stop_words = names + list(ENGLISH_STOP_WORDS)
## I did grid search some of the below hyperparameters using grouped CV
features = FeatureUnion(
[
('tfidf',Pipeline([
('selector',ItemSelector(keys='tweets',ndim=1)),
('tfidf',CustomTfidfVectorizer(use_idf=False,
stop_words=stop_words,
ngram_range=(1,1),
min_df=.05))
])),
('tweet_metadata',Pipeline([
('selector',ItemSelector(keys='tweets',ndim=1)),
('metadata_extractor',TweetTextMetadata())
])),
('tweet_stats',Pipeline([
('selector',ItemSelector(keys=['replies','retweets',
'favorites'],
ndim=2)),
('tweet_stats_extractor',TweetStats())
]))
])
clf = XGBClassifier(learning_rate=.01,n_estimators=100000,
subsample=.9,max_depth=2)
## make train matrix, fit model on train set
X = features.fit_transform(df_train[['tweets','replies',
'retweets','favorites']])
y = df_train['winner'].values
cv = StratifiedKFold(n_splits=6,shuffle=True)
n_estimators = get_num_trees(clf,X,y,cv)
clf.n_estimators = n_estimators
clf.fit(X,y)
feature_names = sorted(['WORD_%s' % (word,)
for word in features.get_params()['tfidf'].get_params()['tfidf'].vocabulary_.keys()]) +\
TweetTextMetadata.names +\
TweetStats.names
## print top 10 importances and their names
importances = clf.feature_importances_
importances = {u:val for u,val in enumerate(importances)}
for ind in sorted(importances,key=importances.get,reverse=True)[:10]:
print(feature_names[ind],importances[ind])
Looking at this feature importances a second time, it looks like the words chosen as important are proxies for whether the candidate is the incumbent. This makes sense, from the little that I know about politics.
Let's look at the training accuracy:
preds = clf.predict_proba(X)[:,1]
## put the raw predictions in the dataframe so we can use df.groupy
df_train['pred_raw'] = preds
df_train.head()
## get dictionaries mappying race index to index of predicted and true winners
preds = df_train.groupby('race_index').pred_raw.apply(lambda x: x.idxmax()).to_dict()
true = df_train.groupby('race_index').winner.apply(lambda x: x.idxmax()).to_dict()
## get train accuracy on race
acc = np.mean([preds[race_ind]==true[race_ind] for race_ind in df_train.race_index.unique()])
acc
Now let's test the model on the 2016 races to see how it performs. I will produce the features using the same function as earlier, make the predictions using the trained model and then make a plot and compute the accuracy.
## get test matrix and predictions
X_test = features.transform(df_test[['tweets','replies','retweets','favorites']])
preds_test = clf.predict_proba(X_test)[:,1]
## make a plot
fig,ax = plt.subplots(1,1,figsize=(13,5))
plt.hist(preds_test[(df_test.winner==1).values],alpha=.5,
label='predictions for winners');
plt.hist(preds_test[(df_test.winner==0).values],alpha=.5,
label='predictions for non-winners');
plt.legend();
plt.title('Test Set Predictions');
## put the raw predictions in the test dataframe so we can use df.groupy
df_test['pred_raw'] = preds_test
## get dictionaries mappying race index to index of predicted and true winners, this time on test set
preds_test = df_test.groupby('race_index').pred_raw.apply(lambda x: x.idxmax()).to_dict()
true_test = df_test.groupby('race_index').winner.apply(lambda x: x.idxmax()).to_dict()
## get test accuracy on race level
acc = np.mean([preds_test[race_ind]==true_test[race_ind] for race_ind in df_test.race_index.unique()])
acc
88% accuracy is not that bad! Considering I used nothing but tweets and imbued no prior knowledge.
Let's take a quick look at where the model failed. First, here's the highest raw prediction (the probability that the candidate will win) for a non-winner:
df_test[~df_test.winner.astype(bool)].sort_values('pred_raw',ascending=False).head(1)
## take a look at 30 of the tweets for Sarah Lloyd
df_test[~df_test.winner.astype(bool)].sort_values('pred_raw',ascending=False).tweets.iloc[0][0:30]
## the race Lloyd lost
df_test[df_test.race_index==151]
print(race_metadata_2016.loc[151])
print(race_metadata_2016.loc[151].Result)
So this looks like a race where Democratic enthusiasm (or twitter activism) was high but the Republican won. It could also have a little to do with the fact that Sarah Lloyd is also the name of a British travel writer but not sure.
Now let's take a look at the lowest raw prediction for a winner:
df_test[df_test.winner.astype(bool)].sort_values('pred_raw').head(1)
## this race
df_test[df_test.race_index==69]
print(race_metadata_2016.loc[69])
print(race_metadata_2016.loc[69].Result)
This one makes more sense to me - "Jason T. Smith" is what I have as the candidate's name (from Wikipedia). Since twitter search looks for an exact string match, it makes sense that the model would not have a good read on this candidate since it's unlikely for people to tweet a name with a middle initial.
Stay tuned: as a next step, I plan to collect the tweets for the month leading up to the 2018 congressional races and post my predictions on election day.