Can you predict congressional elections using Tweets?

Twitter_Movies4

From time to time, I enjoy collecting tweets and trying to predict things with them. At Metis, I collected tweets to predict the opening weekend gross box office sales of new movies (blog post here). This time, I collect tweets about political candidates (specifcally ones with their full name in the tweet) in the month leading up to the 2014 US Congressional elections with the intent of predicting the outcome of the election.

I collected the tweets using Python, BeautifulSoup, Selenium and Chromedriver (script here). I collected tweets from all House and Senate races in 2014 (except a few that got lost in the shuffle). Lucky for you, with the data collection headache out of the way, we get to skip right to the fun part - the analysis and model building!

In [1]:
import pandas as pd
import numpy as np
import json
import codecs
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
In [2]:
race_metadata = pd.read_csv('~/election-twitter/elections-twitter/data/race-metadata.csv')
In [3]:
## here is the race metadata
race_metadata.head(3)
Out[3]:
Race Result Chamber
0 Alabama 1 [["Bradley Byrne", "Republican", 68.2], ["Burt... House
1 Alabama 2 [["Martha Roby", "Republican", 67.3], ["Erick ... House
2 Alabama 3 [["Mike Rogers", "Republican", 66.1], ["Jesse ... House
In [4]:
## each row has this info
race_metadata.loc[0].Result
Out[4]:
'[["Bradley Byrne", "Republican", 68.2], ["Burton LeFlore", "Democratic", 31.7]]'

That's JSON data that has each candidate, their party affiliation, and the percent of the vote they recieved.

In [5]:
## put a column in that grabs the winner out
race_metadata['winner'] = race_metadata.Result.apply(lambda x: json.loads(x)[0][0])
In [6]:
## split into train and test sets
race_metadata_train, race_metadata_test  = train_test_split(race_metadata,test_size=.3,random_state=44)

How many races are we dealing with? you might ask..

In [7]:
race_metadata_train.shape, race_metadata_test.shape
Out[7]:
((308, 4), (133, 4))

So we will train on 308 races worth of tweets, and test on 133.

The first step will be to read in the tweets and get them into a nice tidy dataframe that we can easily compute statistics on and train models. We need to think about how to set up the prediction problem. Obviously, this is classification, since we are trying to predict choices (candidates). But, how many classes are there?

In [8]:
## how many candidates in each race?
race_metadata_train.Result.apply(lambda x: len(json.loads(x))).describe()
Out[8]:
count    308.000000
mean       2.724026
std        1.166221
min        2.000000
25%        2.000000
50%        2.000000
75%        3.000000
max       12.000000
Name: Result, dtype: float64

There are a variable number of classes - not something I've frequently encountered and not very straightforward to deal with. I thought the best way to deal with this is to "map" this problem down to a binary classification problem, in which each candidate has his or her own row which has the independent variables and the dependent variable - a 0 for if they lost their race, and a 1 if they won their race.

Mapping back up to predict the outcome of each race can be done in a few ways. First, I am not trying to predict the percentage of the vote each candidate will receive. That would be a different task. I am strictly looking at this from a classification point of view. With that in mind, the end goal could be to #1 pick the winning candidate for each race or #2 produce a probability that each candidate will win each race. For #1, we can simply use the output of the binary classification model by taking the maximium probability predicted for each race. For #2, we would have to do some more legwork. We can't simply take the output of the binary model and translate that into the probability that each candidate will win. For example, let's say the binary model spits out the following:

  • Candidate 1 has a 55% chance of being a winner
  • Candidate 2 has a 44% chance of being a winner
  • Candidate 3 has a 33% chance of being a winner

--> there is no way of translating the above to what we'd want - chance Candidate 1 wins the race vs. Candidate 2 and Candidate 3 since the above probabilities are not even in the correct Event Space.

If we wanted to get #2, the probability each candidate will win each race, I think we would have build seperate models for each number of classes: One model for each race with 2 candidates, another for each race with 3 candidates, and so on.

I am going to go with #1, picking the winning candidate for each race, since I don't think the result from 2. will be worth the legwork.

For the binary classification problem, we would like a row for each candidate with his or her features, a boolean indicating whether or not they won their race, and a race indentifier so that later we can group by this identifier and take the maximum within the group to get out predicted winner.

In [9]:
def make_ascii(s):
    return s.encode('ascii','ignore').decode('ascii')

def make_df(race_metadata):
    values = []
    for row_ind, row in race_metadata.iterrows():
        try:
            with codecs.open('/Users/adamwlevin/election-twitter/elections-twitter/data/tweets/%s.json' % (make_ascii(row.Race).replace(' ',''),),'r','utf-8-sig') as f:
                tweets = json.load(f)
        except FileNotFoundError:
            print('Did not find %s ' % (row.Race,))
            continue
        for candidate,data in tweets.items():
            record = [[]]*4
            for date,data_ in data.items():
                if data_:
                    data_ = np.array(data_)
                    for i in range(4):
                        record[i] = \
                        np.concatenate([record[i],data_[:,i].astype(int) if i!=0 else data_[:,i]])
            values.append(record+[1 if candidate==row.winner else 0,row_ind])
    return pd.DataFrame(values,columns=['tweets','replies',
                                        'retweets','favorites',
                                        'winner','race_index'])
In [10]:
df_train = make_df(race_metadata_train)
Did not find Arkansas 

A few races - looks like senate races - seem to have gotten lost in the shuffle.

In [11]:
## take a look at the result
df_train.head()
Out[11]:
tweets replies retweets favorites winner race_index
0 [\nLove to see Paul Ryan bring his new pal Gle... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... [0.0, 2.0, 0.0, 0.0, 1.0, 0.0, 3.0, 7.0, 0.0, ... [0.0, 0.0, 2.0, 0.0, 1.0, 1.0, 0.0, 3.0, 0.0, ... 1 400
1 [\n6th District Candidate for Congress Mark Ha... [0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, ... [2.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 7.0, 0.0, 0.0, 0.0, 0.0, ... 0 400
2 [\nGus Fahrendorf: Not a fan ObamaCare. It's i... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, ... 0 400
3 [\nMarshall Adame is an excellent candidate fo... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1 263
4 [\nELECT Democratic Nominee MARSHALL ADAME @Ad... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [9.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 0 263
In [12]:
## each row has vectors for the features:
df_train.iloc[0]
Out[12]:
tweets        [\nLove to see Paul Ryan bring his new pal Gle...
replies       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...
retweets      [0.0, 2.0, 0.0, 0.0, 1.0, 0.0, 3.0, 7.0, 0.0, ...
favorites     [0.0, 0.0, 2.0, 0.0, 1.0, 1.0, 0.0, 3.0, 0.0, ...
winner                                                        1
race_index                                                  400
Name: 0, dtype: object
In [13]:
## a tweet
df_train.iloc[0]['tweets'][0]
Out[13]:
'\nLove to see Paul Ryan bring his new pal Glenn Grothman on his campaign stops courting black voters http://www.bloomberg.com/politics/features/2014-11-03/what-would-a-gop-candidate-say-if-he-absolutely-positively-couldnt-lose\xa0…\n'
In [14]:
## total number of tweets in the train set
df_train.tweets.apply(len).sum()
Out[14]:
203713
In [15]:
## which candidate had the most tweets about him/her
df_train.iloc[np.where(df_train.tweets.apply(len)==df_train.tweets.apply(len).max())[0]]
Out[15]:
tweets replies retweets favorites winner race_index
65 [\nJames Brown "Sex Machine" Rome on April 24... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 1.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 3.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 3.0, 3.0, 0.0, 0.0, ... 0 81
In [16]:
race_metadata[race_metadata.Result.str.contains('James Brown')]
Out[16]:
Race Result Chamber winner
81 Connecticut 3 [["Rosa DeLauro", "Democratic", 67.1], ["James... House Rosa DeLauro

This shows that using the name to search on Twitter is not an exact science. This guy, is the politician, but seems like most people tweeting "James Brown" are talking about the artist.

In [17]:
## now for some ML stuff
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
In [18]:
## This is useful for selecting a subset of features in the middle of a Pipeline
class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, keys, ndim):
        self.keys = keys
        self.ndim = ndim
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, data_dict):
        res = data_dict[self.keys]
        return res

## Making some features about the text itself
class TweetTextMetadata(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, docs):
        ave_words_per_tweet = [sum(len(tweet.split(' '))
                                   for tweet in tweets)\
                               /len(tweets) 
                               if len(tweets) else 0
                               for tweets in docs]
        total_number_words = [sum(len(tweet.split(' '))
                                  for tweet in tweets)
                              for tweets in docs]
        ave_word_len = [sum(len(word) for tweet in tweets 
                            for word in tweet.split(' '))/\
                        sum(1 for tweet in tweets
                            for word in tweet.split(' '))
                        if len(tweets) else 0 for tweets in docs]
        total_periods = [sum(tweet.count('.')
                             for tweet in tweets)
                         for tweets in docs]
        total_q_marks = [sum(tweet.count('?')
                             for tweet in tweets)
                         for tweets in docs]
        return np.column_stack([value 
                                for key,value in locals().items()
                                if isinstance(value,list)])

    names = ['ave_words_per_tweet','total_number_words','ave_word_len','total_periods','total_q_marks']

## Making some features about the favorites, retweets, etc.
class TweetStats(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, df):
        warnings.filterwarnings("ignore",
                                message="Mean of empty slice.")
        total_replies = df.replies.apply(sum)
        total_retweets = df.retweets.apply(sum)
        total_favorites = df.favorites.apply(sum)
        num_tweets = df.replies.apply(len)
        ave_replies_per_tweet = df.replies.apply(np.mean).fillna(0)
        ave_retweets_per_tweet = df.retweets.apply(np.mean).fillna(0)
        ave_favorites_per_tweet = df.favorites.apply(np.mean).fillna(0)
        ninety_eighth_percentile_replies = df.replies.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
        ninety_eighth_percentile_retweets = df.retweets.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
        ninety_eighth_percentile_favorites = df.favorites.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
        return np.column_stack([value.values for key,value in locals().items() if isinstance(value,pd.Series)])
    
    names = ['total_replies','total_retweets','total_favorites',
             'num_tweets','ave_replies_per_tweet','ave_retweets_per_tweet',
             'ave_favorites_per_tweet','ninety_eighth_percentile_replies',
             'ninety_eighth_percentile_retweets',
             'ninety_eighth_percentile_favorites']

## This inherits a TfidfVectorizer and just cleans the tweets a little before vectorizing them
## (this is probably unnecessary but haven't tested)
class CustomTfidfVectorizer(TfidfVectorizer):
    def cleanse_tweets(self,tweets):
        return ' '.join([word for tweet in tweets
                         for word in tweet.split(' ') 
                         if 'http://' not in word
                         and 'www.' not in word
                         and '@' not in word
                         and 'https://' not in word
                         and '.com' not in word
                         and '.net' not in word])
    
    def fit(self, x, y=None):
        return super().fit(x.apply(self.cleanse_tweets).values)
    
    def transform(self, x):
        return super().transform(x.apply(self.cleanse_tweets).values)

    def fit_transform(self, x, y=None):
        self.fit(x,y)
        return self.transform(x)

## This takes in a XGBClassifier and finds the optimal number of trees using CV
def get_num_trees(clf,X,y,cv,eval_metric='logloss',early_stopping_rounds=10):
    n_trees = []
    for train,test in cv.split(X,y):
        clf.fit(X[train], y[train],
                eval_set=[[X[test],y[test]]],
                eval_metric=eval_metric, 
                early_stopping_rounds=early_stopping_rounds,
                verbose=False)
        n_trees.append(clf.best_iteration)
    print('Number of trees selected: %d' % \
          (int(sum(n_trees)/len(n_trees)),))
    return int(sum(n_trees)/len(n_trees))

I am putting the names of the candidates in as stop words since it doesn't makes sense to use the name of any of the candidates to predict a generic candidate's fortunes. This is probably dealt with by setting min_df on the TfidfVectorizer reasonably, but I'll do it anywsay just to be safe.

In [19]:
names = [name_.lower() for result in race_metadata.Result
         for name,_,_ in json.loads(result) for name_ in name.split()]
stop_words = names + list(ENGLISH_STOP_WORDS)
In [20]:
## I did grid search some of the below hyperparameters using grouped CV
features = FeatureUnion(
[
('tfidf',Pipeline([
    ('selector',ItemSelector(keys='tweets',ndim=1)),
    ('tfidf',CustomTfidfVectorizer(use_idf=False,
                                   stop_words=stop_words,
                                   ngram_range=(1,1),
                                   min_df=.05))
])),
('tweet_metadata',Pipeline([
    ('selector',ItemSelector(keys='tweets',ndim=1)),
    ('metadata_extractor',TweetTextMetadata())
])),
('tweet_stats',Pipeline([
    ('selector',ItemSelector(keys=['replies','retweets',
                                   'favorites'],
                             ndim=2)),
    ('tweet_stats_extractor',TweetStats())
]))
])

clf = XGBClassifier(learning_rate=.01,n_estimators=100000,
                    subsample=.9,max_depth=2)
In [21]:
X = features.fit_transform(df_train[['tweets','replies',
                                     'retweets','favorites']])
y = df_train['winner'].values
cv = StratifiedKFold(n_splits=6,shuffle=True)
n_estimators = get_num_trees(clf,X,y,cv)
clf.n_estimators = n_estimators
clf.fit(X,y)
feature_names = sorted(['WORD_%s' % (word,)
                        for word in features.get_params()['tfidf'].get_params()['tfidf'].vocabulary_.keys()]) +\
                TweetTextMetadata.names +\
                TweetStats.names
Number of trees selected: 445

Let's do some model introspection by looking at the prediction on the train set along with the outcome, and the most influencial features.

In [22]:
## predict train set
preds = clf.predict_proba(X)[:,1]
In [23]:
## make a plot
fig,ax = plt.subplots(1,1,figsize=(13,5))
plt.hist(preds[(df_train.winner==1).values],alpha=.5,
         label='predictions for winners');
plt.hist(preds[(df_train.winner==0).values],alpha=.5,
         label='predictions for non-winners');
plt.legend();
plt.title('Train Set Predictions');
In [24]:
## print top 10 importances and their names
importances = clf.feature_importances_
importances = {u:val for u,val in enumerate(importances)}
for ind in sorted(importances,key=importances.get,reverse=True)[:10]:
    print(feature_names[ind],importances[ind])
WORD_congressman 0.13242
WORD_evaluation 0.100457
WORD_members 0.0456621
WORD_congress 0.04414
WORD_old 0.04414
WORD_113th 0.0410959
WORD_article 0.0365297
WORD_congresswoman 0.0258752
WORD_sen 0.0235921
WORD_ads 0.0228311

So, the model seems to fit the train set pretty well and the top features (according to XGBoost's feature importance) are all from the Tfidf.

Now, let's do that group by the race index max on prediction to get the predicted winner of each race for the train set. This way we can get the accuracy of predicting the winner of each race.

In [25]:
## put the raw predictions in the dataframe so we can use df.groupy
df_train['pred_raw'] = preds
In [26]:
## get dictionaries mappying race index to index of predicted and true winners
preds = df_train.groupby('race_index').pred_raw.apply(lambda x: x.idxmax()).to_dict()
true = df_train.groupby('race_index').winner.apply(lambda x: x.idxmax()).to_dict()
In [27]:
## get train accuracy on race
acc = np.mean([preds[race_ind]==true[race_ind] for race_ind in df_train.race_index.unique()])
acc
Out[27]:
0.95114006514657978

Now let's test the model on the holdout set to see how it performs. I will first make the test dataframe, then produce the features, make the predictions and then make another plot and compute the accuracy.

In [28]:
## get test df
df_test = make_df(race_metadata_test)
Did not find Alaska 
In [29]:
## get test matrix and predictions
X_test = features.transform(df_test[['tweets','replies','retweets','favorites']])
preds_test = clf.predict_proba(X_test)[:,1]
In [30]:
## make a plot
fig,ax = plt.subplots(1,1,figsize=(13,5))
plt.hist(preds_test[(df_test.winner==1).values],alpha=.5,
         label='predictions for winners');
plt.hist(preds_test[(df_test.winner==0).values],alpha=.5,
         label='predictions for non-winners');
plt.legend();
plt.title('Test Set Predictions');
In [31]:
## put the raw predictions in the test dataframe so we can use df.groupy
df_test['pred_raw'] = preds_test
In [32]:
## get dictionaries mappying race index to index of predicted and true winners, this time on test set
preds_test = df_test.groupby('race_index').pred_raw.apply(lambda x: x.idxmax()).to_dict()
true_test = df_test.groupby('race_index').winner.apply(lambda x: x.idxmax()).to_dict()
In [33]:
## get test accuracy on race
acc = np.mean([preds_test[race_ind]==true_test[race_ind] for race_ind in df_test.race_index.unique()])
acc
Out[33]:
0.93181818181818177

As a final step, let's take a look at the race in the test set with the most candidates and see what the model did.

In [34]:
df_test[df_test.race_index==df_test.groupby('race_index').size().idxmax()]
Out[34]:
tweets replies retweets favorites winner race_index pred_raw
299 [\nOpponents have made this a money race, $2.7... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 17.0, 2.0, 1.0, 0.0, 0.0,... [0.0, 0.0, 0.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, ... 1 223 0.935458
300 [\nHughes, LoBiondo clash in their only debate... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, ... 0 223 0.203180
301 [] [] [] [] 0 223 0.049491
302 [] [] [] [] 0 223 0.049491
303 [] [] [] [] 0 223 0.049491
304 [\nGary Stein: Anyone Who Thinks Amendment 2 ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 0 223 0.055986

The model predicted the top candidate as the winner and the top candidate actually won!

Stay tuned: as a next step, I plan to collect the tweets for the month leading up to the 2016 congressional races and see how this model performs on tweets from 4 years later.

Written on April 23, 2018