Can you predict congressional elections using Tweets? Part 2

2016CongresTweets

As promised in this post, I am back to predict all of the midterm elections from 2016 using the tweets from the month leading up to the election and a model trained on the races from the 2014 midterms.

Refresher: I used a script that queries the twitter advanced search feature for all the tweets that contain each candidates full name. I saved all such tweets that were written in the month leading up to the 2014 midterm elections and now for the month leading up to the 2016 midterm elections.

Now, I will build the same model that I built in the previous post except this time I will train on the 2014 races and evaluate on the 2016 races. Let's see how the model performs!

In [1]:
import pandas as pd
import numpy as np
import json
import codecs
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
race_metadata = pd.read_csv('~/election-twitter/elections-twitter/data/race-metadata.csv')
race_metadata_2016 = pd.read_csv('~/election-twitter/elections-twitter/data/race-metadata-2016.csv')
In [3]:
race_metadata.head()
Out[3]:
Race Result Chamber
0 Alabama 1 [["Bradley Byrne", "Republican", 68.2], ["Burt... House
1 Alabama 2 [["Martha Roby", "Republican", 67.3], ["Erick ... House
2 Alabama 3 [["Mike Rogers", "Republican", 66.1], ["Jesse ... House
3 Alabama 5 [["Mo Brooks", "Republican", 74.4], ["Mark Bra... House
4 Alabama 6 [["Gary Palmer", "Republican", 76.2], ["Mark L... House
In [4]:
race_metadata_2016.head()
Out[4]:
Race Result Chamber
0 Alaska at-large [["Don Young", "Republican", 50.3], ["Steve Li... House
1 Arizona 1 [["Tom O'Halleran", "Democratic", 50.7], ["Pau... House
2 Arizona 7 [["Ruben Gallego", "Democratic", 75.2], ["Eve ... House
3 Arizona 8 [["Trent Franks", "Republican", 68.5], ["Mark ... House
4 Colorado 1 [["Diana DeGette", "Democratic", 67.9], ["Casp... House
In [5]:
## How many races in the train and test sets?
race_metadata.shape[0], race_metadata_2016.shape[0]
Out[5]:
(441, 177)
In [6]:
## put a column in that grabs the winner out
race_metadata['winner'] = race_metadata.Result.apply(lambda x: json.loads(x)[0][0])
race_metadata_2016['winner'] = race_metadata_2016.Result.apply(lambda x: json.loads(x)[0][0])
In [7]:
race_metadata.head()
Out[7]:
Race Result Chamber winner
0 Alabama 1 [["Bradley Byrne", "Republican", 68.2], ["Burt... House Bradley Byrne
1 Alabama 2 [["Martha Roby", "Republican", 67.3], ["Erick ... House Martha Roby
2 Alabama 3 [["Mike Rogers", "Republican", 66.1], ["Jesse ... House Mike Rogers
3 Alabama 5 [["Mo Brooks", "Republican", 74.4], ["Mark Bra... House Mo Brooks
4 Alabama 6 [["Gary Palmer", "Republican", 76.2], ["Mark L... House Gary Palmer
In [8]:
race_metadata_2016.head()
Out[8]:
Race Result Chamber winner
0 Alaska at-large [["Don Young", "Republican", 50.3], ["Steve Li... House Don Young
1 Arizona 1 [["Tom O'Halleran", "Democratic", 50.7], ["Pau... House Tom O'Halleran
2 Arizona 7 [["Ruben Gallego", "Democratic", 75.2], ["Eve ... House Ruben Gallego
3 Arizona 8 [["Trent Franks", "Republican", 68.5], ["Mark ... House Trent Franks
4 Colorado 1 [["Diana DeGette", "Democratic", 67.9], ["Casp... House Diana DeGette
In [9]:
## how many candidates in each race in the train set?
race_metadata.Result.apply(lambda x: len(json.loads(x))).describe()
Out[9]:
count    441.000000
mean       2.725624
std        1.101446
min        2.000000
25%        2.000000
50%        2.000000
75%        3.000000
max       12.000000
Name: Result, dtype: float64
In [10]:
## how many candidates in each race in the test set?
race_metadata_2016.Result.apply(lambda x: len(json.loads(x))).describe()
Out[10]:
count    177.00000
mean       3.59887
std        1.24435
min        3.00000
25%        3.00000
50%        3.00000
75%        4.00000
max       12.00000
Name: Result, dtype: float64

The below is the same code as last post with a couple modifications:

  • the 2016 tweets are in a different directory so I added an argument to the function that modifies the path for the year.
  • The 2016 race metadata has some data quality issues which I didn't notice until now - some of the candidate names are not actual people but really '–' or 'Write-Ins'. I am filtering these out since there are not helpful for our goal of predicting who will win a race.
In [11]:
def make_ascii(s):
    return s.encode('ascii','ignore').decode('ascii')

def make_df(race_metadata,year=2014):
    values = []
    path = '/Users/adamwlevin/election-twitter/elections-twitter/data/tweets'
    if year==2016:
        path += '/t2016'
    for row_ind, row in race_metadata.iterrows():
        try:
            with codecs.open('%s/%s.json' % (path,make_ascii(row.Race).replace(' ',''),),'r','utf-8-sig') as f:
                tweets = json.load(f)
        except FileNotFoundError:
            print('Did not find %s ' % (row.Race,))
            continue
        for candidate,data in tweets.items():
            if candidate in ('–','Blank/Void/Scattering','Write-Ins','Others'):
                continue
            record = [[]]*4
            for date,data_ in data.items():
                if data_ and data_!='Made 5 attempts, all unsucessful.':
                    data_ = np.array(data_)
                    for i in range(4):
                        record[i] = \
                        np.concatenate([record[i],data_[:,i].astype(int) if i!=0 else data_[:,i]])
            values.append([candidate]+record+[1 if candidate==row.winner else 0,row_ind])
    return pd.DataFrame(values,columns=['candidate','tweets','replies',
                                        'retweets','favorites',
                                        'winner','race_index'])
In [12]:
## make the train set and test set
df_train = make_df(race_metadata)
df_test = make_df(race_metadata_2016,year=2016)
Did not find Alaska 
Did not find Arkansas 
In [13]:
## take a look at the result
df_train.head()
Out[13]:
candidate tweets replies retweets favorites winner race_index
0 Bradley Byrne [\nThings to do Tuesday:\n-VOTE; remind friend... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 3.0, 0.0, 0.0, 0.0, 7.0, 0.0, 0.0, 0.0, ... [0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 1 0
1 Burton LeFlore [\nWATCH: http://ln.is/youtu.be/jvTA8  ELECT... [0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, ... [0.0, 7.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... 0 0
2 Martha Roby [\nGlad to see and hear Martha Roby at Rotary ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 17.0, 0.0, 0.0, 0.0, 0.0, 0.0,... [0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, ... 1 1
3 Erick Wright [\nBetter Vote Erick Wright, Jennifer S. Marsd... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 17.0, 0.0, 0.0, 0.0, 0.0,... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... 0 1
4 Mike Rogers [\nNSA Director Mike Rogers will begin his add... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 7.0, 0.0, 2.0, 1.0, 0.0, 2.0, 2.0, ... [0.0, 0.0, 1.0, 5.0, 1.0, 1.0, 0.0, 0.0, 1.0, ... 1 2
In [14]:
df_test.head()
Out[14]:
candidate tweets replies retweets favorites winner race_index
0 Don Young [\nAdded a new video: "Satisfaction Lamajj ft ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 15.0, 0.0, 0.0,... [1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 28.0, 0.0, 0.0,... 1 0
1 Steve Lindbeck [\nSteve Lindbeck is running as a Democrat aga... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, ... [0.0, 15.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 4.0,... [0.0, 28.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 2.0,... 0 0
2 Jim C. McDermott [] [] [] [] 0 0
3 Bernie Souphanavong [\nBernie Souphanavong brings up drones, Dosto... [0.0, 0.0, 0.0] [0.0, 0.0, 6.0] [0.0, 0.0, 4.0] 0 0
4 Stephen Wright [\nWho writes her material? I'd buy a CD for h... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 1.0, ... [0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 6.0, 1.0, 0.0, 3.0, 1.0, 1.0, ... 0 0
In [15]:
## who has the most tweets of the 2016 candidates?
df_test.loc[df_test.tweets.apply(len).idxmax()]
Out[15]:
candidate                                             Paul Ryan
tweets        [\nThe nightmare for Republicans is, if Trump ...
replies       [153.0, 5.0, 11.0, 2.0, 9.0, 2.0, 2.0, 4.0, 62...
retweets      [944.0, 31.0, 40.0, 40.0, 88.0, 58.0, 18.0, 50...
favorites     [2766.0, 56.0, 16.0, 79.0, 93.0, 62.0, 23.0, 1...
winner                                                        1
race_index                                                  148
Name: 501, dtype: object

Now, I will do the same model building procedure as last time. There are three classes of features: metadata about the language within the tweets (i.e. average number of words per tweet), metadata about the tweets (i.e. average number of replies per tweet), and a tfidf vectorizer built using the words as tokens treating all of the tweets about a candidate concatenated together as a document (i.e. tfidf score of the word "congressman"). The model is an XGBoost classifier.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
In [17]:
## This is useful for selecting a subset of features in the middle of a Pipeline
class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, keys, ndim):
        self.keys = keys
        self.ndim = ndim
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, data_dict):
        res = data_dict[self.keys]
        return res

## Making some features about the text itself
class TweetTextMetadata(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, docs):
        ave_words_per_tweet = [sum(len(tweet.split(' '))
                                   for tweet in tweets)\
                               /len(tweets) 
                               if len(tweets) else 0
                               for tweets in docs]
        total_number_words = [sum(len(tweet.split(' '))
                                  for tweet in tweets)
                              for tweets in docs]
        ave_word_len = [sum(len(word) for tweet in tweets 
                            for word in tweet.split(' '))/\
                        sum(1 for tweet in tweets
                            for word in tweet.split(' '))
                        if len(tweets) else 0 for tweets in docs]
        total_periods = [sum(tweet.count('.')
                             for tweet in tweets)
                         for tweets in docs]
        total_q_marks = [sum(tweet.count('?')
                             for tweet in tweets)
                         for tweets in docs]
        return np.column_stack([value 
                                for key,value in locals().items()
                                if isinstance(value,list)])

    names = ['ave_words_per_tweet','total_number_words','ave_word_len','total_periods','total_q_marks']

## Making some features about the favorites, retweets, etc.
class TweetStats(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, df):
        warnings.filterwarnings("ignore",
                                message="Mean of empty slice.")
        total_replies = df.replies.apply(sum)
        total_retweets = df.retweets.apply(sum)
        total_favorites = df.favorites.apply(sum)
        num_tweets = df.replies.apply(len)
        ave_replies_per_tweet = df.replies.apply(np.mean).fillna(0)
        ave_retweets_per_tweet = df.retweets.apply(np.mean).fillna(0)
        ave_favorites_per_tweet = df.favorites.apply(np.mean).fillna(0)
        ninety_eighth_percentile_replies = df.replies.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
        ninety_eighth_percentile_retweets = df.retweets.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
        ninety_eighth_percentile_favorites = df.favorites.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
        return np.column_stack([value.values for key,value in locals().items() if isinstance(value,pd.Series)])
    
    names = ['total_replies','total_retweets','total_favorites',
             'num_tweets','ave_replies_per_tweet','ave_retweets_per_tweet',
             'ave_favorites_per_tweet','ninety_eighth_percentile_replies',
             'ninety_eighth_percentile_retweets',
             'ninety_eighth_percentile_favorites']

## This inherits a TfidfVectorizer and just cleans the tweets a little before vectorizing them
## (this is probably unnecessary but haven't tested)
class CustomTfidfVectorizer(TfidfVectorizer):
    def cleanse_tweets(self,tweets):
        return ' '.join([word for tweet in tweets
                         for word in tweet.split(' ') 
                         if 'http://' not in word
                         and 'www.' not in word
                         and '@' not in word
                         and 'https://' not in word
                         and '.com' not in word
                         and '.net' not in word])
    
    def fit(self, x, y=None):
        return super().fit(x.apply(self.cleanse_tweets).values)
    
    def transform(self, x):
        return super().transform(x.apply(self.cleanse_tweets).values)

    def fit_transform(self, x, y=None):
        self.fit(x,y)
        return self.transform(x)

## This takes in a XGBClassifier and finds the optimal number of trees using CV
def get_num_trees(clf,X,y,cv,eval_metric='logloss',early_stopping_rounds=10):
    n_trees = []
    for train,test in cv.split(X,y):
        clf.fit(X[train], y[train],
                eval_set=[[X[test],y[test]]],
                eval_metric=eval_metric, 
                early_stopping_rounds=early_stopping_rounds,
                verbose=False)
        n_trees.append(clf.best_iteration)
    print('Number of trees selected: %d' % \
          (int(sum(n_trees)/len(n_trees)),))
    return int(sum(n_trees)/len(n_trees))
In [18]:
names = [name_.lower() for result in race_metadata.Result
         for name,_,_ in json.loads(result) for name_ in name.split()]
stop_words = names + list(ENGLISH_STOP_WORDS)
In [19]:
## I did grid search some of the below hyperparameters using grouped CV
features = FeatureUnion(
[
('tfidf',Pipeline([
    ('selector',ItemSelector(keys='tweets',ndim=1)),
    ('tfidf',CustomTfidfVectorizer(use_idf=False,
                                   stop_words=stop_words,
                                   ngram_range=(1,1),
                                   min_df=.05))
])),
('tweet_metadata',Pipeline([
    ('selector',ItemSelector(keys='tweets',ndim=1)),
    ('metadata_extractor',TweetTextMetadata())
])),
('tweet_stats',Pipeline([
    ('selector',ItemSelector(keys=['replies','retweets',
                                   'favorites'],
                             ndim=2)),
    ('tweet_stats_extractor',TweetStats())
]))
])

clf = XGBClassifier(learning_rate=.01,n_estimators=100000,
                    subsample=.9,max_depth=2)
In [20]:
## make train matrix, fit model on train set
X = features.fit_transform(df_train[['tweets','replies',
                                     'retweets','favorites']])
y = df_train['winner'].values
cv = StratifiedKFold(n_splits=6,shuffle=True)
n_estimators = get_num_trees(clf,X,y,cv)
clf.n_estimators = n_estimators
clf.fit(X,y)
feature_names = sorted(['WORD_%s' % (word,)
                        for word in features.get_params()['tfidf'].get_params()['tfidf'].vocabulary_.keys()]) +\
                TweetTextMetadata.names +\
                TweetStats.names
Number of trees selected: 476
In [21]:
## print top 10 importances and their names
importances = clf.feature_importances_
importances = {u:val for u,val in enumerate(importances)}
for ind in sorted(importances,key=importances.get,reverse=True)[:10]:
    print(feature_names[ind],importances[ind])
WORD_congressman 0.136006
WORD_evaluation 0.0844667
WORD_113th 0.0622763
WORD_members 0.0450966
WORD_congress 0.0350752
WORD_ad 0.027917
WORD_ads 0.0257695
WORD_rep 0.0250537
WORD_state 0.0229062
WORD_youtube 0.0221904

Looking at this feature importances a second time, it looks like the words chosen as important are proxies for whether the candidate is the incumbent. This makes sense, from the little that I know about politics.

Let's look at the training accuracy:

In [22]:
preds = clf.predict_proba(X)[:,1]
## put the raw predictions in the dataframe so we can use df.groupy
df_train['pred_raw'] = preds
In [23]:
df_train.head()
Out[23]:
candidate tweets replies retweets favorites winner race_index pred_raw
0 Bradley Byrne [\nThings to do Tuesday:\n-VOTE; remind friend... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 3.0, 0.0, 0.0, 0.0, 7.0, 0.0, 0.0, 0.0, ... [0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 1 0 0.979400
1 Burton LeFlore [\nWATCH: http://ln.is/youtu.be/jvTA8  ELECT... [0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, ... [0.0, 7.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... 0 0 0.055113
2 Martha Roby [\nGlad to see and hear Martha Roby at Rotary ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 17.0, 0.0, 0.0, 0.0, 0.0, 0.0,... [0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, ... 1 1 0.953310
3 Erick Wright [\nBetter Vote Erick Wright, Jennifer S. Marsd... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 17.0, 0.0, 0.0, 0.0, 0.0,... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... 0 1 0.043328
4 Mike Rogers [\nNSA Director Mike Rogers will begin his add... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 7.0, 0.0, 2.0, 1.0, 0.0, 2.0, 2.0, ... [0.0, 0.0, 1.0, 5.0, 1.0, 1.0, 0.0, 0.0, 1.0, ... 1 2 0.954916
In [24]:
## get dictionaries mappying race index to index of predicted and true winners
preds = df_train.groupby('race_index').pred_raw.apply(lambda x: x.idxmax()).to_dict()
true = df_train.groupby('race_index').winner.apply(lambda x: x.idxmax()).to_dict()
In [25]:
## get train accuracy on race
acc = np.mean([preds[race_ind]==true[race_ind] for race_ind in df_train.race_index.unique()])
acc
Out[25]:
0.95216400911161736

Now let's test the model on the 2016 races to see how it performs. I will produce the features using the same function as earlier, make the predictions using the trained model and then make a plot and compute the accuracy.

In [26]:
## get test matrix and predictions
X_test = features.transform(df_test[['tweets','replies','retweets','favorites']])
preds_test = clf.predict_proba(X_test)[:,1]
In [27]:
## make a plot
fig,ax = plt.subplots(1,1,figsize=(13,5))
plt.hist(preds_test[(df_test.winner==1).values],alpha=.5,
         label='predictions for winners');
plt.hist(preds_test[(df_test.winner==0).values],alpha=.5,
         label='predictions for non-winners');
plt.legend();
plt.title('Test Set Predictions');
In [28]:
## put the raw predictions in the test dataframe so we can use df.groupy
df_test['pred_raw'] = preds_test
In [29]:
## get dictionaries mappying race index to index of predicted and true winners, this time on test set
preds_test = df_test.groupby('race_index').pred_raw.apply(lambda x: x.idxmax()).to_dict()
true_test = df_test.groupby('race_index').winner.apply(lambda x: x.idxmax()).to_dict()
In [30]:
## get test accuracy on race level
acc = np.mean([preds_test[race_ind]==true_test[race_ind] for race_ind in df_test.race_index.unique()])
acc
Out[30]:
0.88135593220338981

88% accuracy is not that bad! Considering I used nothing but tweets and imbued no prior knowledge.

Let's take a quick look at where the model failed. First, here's the highest raw prediction (the probability that the candidate will win) for a non-winner:

In [31]:
df_test[~df_test.winner.astype(bool)].sort_values('pred_raw',ascending=False).head(1)
Out[31]:
candidate tweets replies retweets favorites winner race_index pred_raw
511 Sarah Lloyd [\nours increase when more anxious as well. bu... [2.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... [0.0, 0.0, 1.0, 1.0, 4.0, 0.0, 0.0, 5.0, 1.0, ... [1.0, 0.0, 3.0, 0.0, 3.0, 0.0, 0.0, 5.0, 4.0, ... 0 151 0.801264
In [32]:
## take a look at 30 of the tweets for Sarah Lloyd
df_test[~df_test.winner.astype(bool)].sort_values('pred_raw',ascending=False).tweets.iloc[0][0:30]
Out[32]:
array([ '\nours increase when more anxious as well. but also notice everyday stuff after hearing Sarah Lloyd talk.\n',
       '\n2/2 having heard Sarah Lloyd, wondering about under-developed sensory system and using sensory integration.\n',
       '\n@lloyd4wi Sarah Lloyd finds #frogandtoad in #littlefreelibrary.  #flipitdem.pic.twitter.com/Z6j2wtPXU9\n',
       '\n@lloyd4wi Sarah Lloyd meeting voters who want a change. #flipitdem\n@WisDems know what a difference she will makepic.twitter.com/f8Xt7qBYc5\n',
       '\nSarah Lloyd @lloyd4wi out and about in #fdl.  @WisDems working to #flipitdem\n@Khary4Congress\n@Ryan_Solen \n@NelsonforWI\n@MaryHoeftForWIpic.twitter.com/WW49weFkLX\n',
       '\nSarah Lloyd smart, articulate candidate http://fondul.ac/2ezRmr8\xa0 #wiunion #wiright #wipolitics\n',
       '\nWI-06 Sarah LLOYD\n4 CONGRESS@FlipItDem\n@lloyd4wi#BlueWI\nFLIP  HOUSE  BLUE & \nMadame POTUSDem Majority\nDEFEAT Glenn Grothmanpic.twitter.com/9Dw5MUnuhk\n',
       '\nDonald Trump & Glenn Grothman think Rapists have Rights! \nWisconsin, Vote for Sarah Lloyd\n@lloyd4wi @FlipItDem #BlueWave2016 #ImWithHerpic.twitter.com/xeQlZWBZm0\n',
       "\nIt's the final weekend. @WisDems time to get out\n@lloyd4wi Sarah Lloyd\n@Khary4Congress\n@NelsonforWI \nLet's #FlipItDem and make a differencepic.twitter.com/VDB2dO1M41\n",
       '\nDonald Trump & Glenn Grothman want to Punish You! \nFight back Wisconsin, vote for Sarah Lloyd! @lloyd4wi @FlipItDem #ImWithHer #BlueWave2016pic.twitter.com/NYZPhQpOAe\n',
       '\nSettling down to hear Sarah Lloyd #sensoryintegration\n',
       '\nRetweeted Sarah Lloyd-Hughes (@gingernibbles):\n\nThe five types of difficult audience members… and how to handle... http://fb.me/4jM5VlkQx\xa0\n',
       '\nWI 06  SARAH LLOYD\n4CONGRESSMUST WIN\n@lloyd4wi@FlipItDem\nFLIPHOUSEBLUE\nMadame POTUSDem Majority\nDEFEAT Glenn Grothmanpic.twitter.com/5ve6EQYXtG\n',
       '\nEditorial: Sarah Lloyd is ready to represent Wisconsin http://host.madison.com/opinion/editorial/editorial-sarah-lloyd-is-ready-to-represent-wisconsin/article_c794e3bb-2a00-5ba1-b6da-79c51cee6b07.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share\xa0… via @CapTimes #wiunion #wiright #DumpGrothman\n',
       '\nWisconsin we need Sarah Lloyd!  Vote  for her on Tuesday !\n',
       "\nDonald Trump & Glenn Grothman don't want YOU to know this! \nVote for Sarah Lloyd\n@lloyd4wi @FlipItDem #ThursdayThoughts #BlueWave2016pic.twitter.com/KzlaT2diHw\n",
       '\nEditorial: Sarah Lloyd is ready to represent Wisconsin http://host.madison.com/opinion/editorial/editorial-sarah-lloyd-is-ready-to-represent-wisconsin/article_c794e3bb-2a00-5ba1-b6da-79c51cee6b07.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share\xa0… via @CapTimes #wiunion #wipolitics #DumpGrothman\n',
       '\nCandidates4Prog: RT NewWisGov: Sarah Lloyd is ready to represent Wisconsin http://host.madison.com/opinion/editorial/editorial-sarah-lloyd-is-ready-to-represent-wisconsin/article_c794e3bb-2a00-5ba1-b6da-79c51cee6b07.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share\xa0… #wipolitics #WI06 #WiBackHer\n',
       '\nSarah Lloyd = Wisconsins next progressive hero. And btw we need them! https://twitter.com/newwisgov/status/793995406627606528\xa0…\n',
       '\nSarah Lloyd is ready to represent Wisconsin http://host.madison.com/opinion/editorial/editorial-sarah-lloyd-is-ready-to-represent-wisconsin/article_c794e3bb-2a00-5ba1-b6da-79c51cee6b07.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share\xa0… #wipolitics #WI06 #WiBackHer\n',
       '\nEditorial: Sarah Lloyd is ready to represent Wisconsin http://host.madison.com/opinion/editorial/editorial-sarah-lloyd-is-ready-to-represent-wisconsin/article_c794e3bb-2a00-5ba1-b6da-79c51cee6b07.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share\xa0… via @CapTimes #wiunion #wipolitics #DumpGrothman\n',
       '\nRarely do we get superbly qualified candidates. Sarah Lloyd @lloyd4wi is good news. #flipit #voteblue #ourrevolution @Khary4Congress too!https://twitter.com/bluejean_nation/status/793844737471975424\xa0…\n',
       "\nDonald Trump & Glenn Grothman don't want YOU to know this! \nVote for Sarah Lloyd\n@lloyd4wi @FlipItDem #BlueWave2016 #ImWithHerpic.twitter.com/MlwPwasjCD\n",
       '\nRemember you asked for this Sarah Lloyd you no one to blame but yourself. http://fb.me/4hrS7l8Ej\xa0\n',
       '\nJust had a great talk on the phone with congressional candidate for Wisconsin Sarah Lloyd\n',
       '\nAwesome interview with Sarah Lloyd @lloyd4wi http://www.radioplusinfo.com/episode/11-1-democratic-6th-district-congressional-candidate-sarah-lloyd/\xa0…. #ourrevolution #voteblue #flipit\n',
       '\nSarah Lloyd talking about the murder of her son to knife crime...\n#knifecrimedangers\n#besafe\n@WYP_LeedsINEpic.twitter.com/xTVgVG4Cgv\n',
       '\nSarah Lloyd for Congress. http://fb.me/8aUhm7vcL\xa0\n',
       "\nDonald Trump & Glenn Grothman's New & More Evil GOP! \nFight back Wisconsin, vote for Sarah Lloyd! \n@lloyd4wi #BlueWave2016 @FlipItDempic.twitter.com/6vmtd0FAg5\n",
       '\nEditorial: Sarah Lloyd is ready to represent Wisconsin - http://Madison.com\xa0 http://dlvr.it/MZblyh\xa0 #Madison #Wisconsin\n'],
      dtype='<U319')
In [33]:
## the race Lloyd lost
df_test[df_test.race_index==151]
Out[33]:
candidate tweets replies retweets favorites winner race_index pred_raw
510 Glenn Grothman [\nThey're ubiquitous up here. My "brand Glen... [1.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 5.0, 0.0, 0.0, 5.0, 0.0, 1.0, 0.0, ... [1.0, 1.0, 5.0, 1.0, 0.0, 2.0, 0.0, 1.0, 0.0, ... 1 151 0.483090
511 Sarah Lloyd [\nours increase when more anxious as well. bu... [2.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... [0.0, 0.0, 1.0, 1.0, 4.0, 0.0, 0.0, 5.0, 1.0, ... [1.0, 0.0, 3.0, 0.0, 3.0, 0.0, 0.0, 5.0, 4.0, ... 0 151 0.801264
512 Jeff Dahlke [\nJeff Dahlke. I'm only voting him bc he boug... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 4.0, 0.0, 0.0, 0.0] [1.0, 1.0, 13.0, 0.0, 0.0, 0.0] 0 151 0.048588
In [34]:
print(race_metadata_2016.loc[151])
print(race_metadata_2016.loc[151].Result)
Race                                             Wisconsin 6
Result     [["Glenn Grothman", "Republican", 57.2], ["Sar...
Chamber                                                House
winner                                        Glenn Grothman
Name: 151, dtype: object
[["Glenn Grothman", "Republican", 57.2], ["Sarah Lloyd", "Democratic", 37.3], ["Jeff Dahlke", "Independent", 5.5]]

So this looks like a race where Democratic enthusiasm (or twitter activism) was high but the Republican won. It could also have a little to do with the fact that Sarah Lloyd is also the name of a British travel writer but not sure.

Now let's take a look at the lowest raw prediction for a winner:

In [35]:
df_test[df_test.winner.astype(bool)].sort_values('pred_raw').head(1)
Out[35]:
candidate tweets replies retweets favorites winner race_index pred_raw
240 Jason T. Smith [\nEver ask "Why ain't my SH*T selling?" Join ... [0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0] [0.0, 0.0, 4.0, 0.0] 1 69 0.039982
In [36]:
## this race
df_test[df_test.race_index==69]
Out[36]:
candidate tweets replies retweets favorites winner race_index pred_raw
240 Jason T. Smith [\nEver ask "Why ain't my SH*T selling?" Join ... [0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0] [0.0, 0.0, 4.0, 0.0] 1 69 0.039982
241 Dave Cowell [\nJust so happens the person who runs Armatur... [0.0, 0.0, 1.0, 0.0, 0.0, 3.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 8.0, 0.0, 1.0, 0.0, ... [1.0, 2.0, 1.0, 3.0, 0.0, 21.0, 2.0, 2.0, 1.0,... 0 69 0.051360
242 Jonathan Shell [] [] [] [] 0 69 0.040377
In [37]:
print(race_metadata_2016.loc[69])
print(race_metadata_2016.loc[69].Result)
Race                                              Missouri 8
Result     [["Jason T. Smith", "Republican", 74.4], ["Dav...
Chamber                                                House
winner                                        Jason T. Smith
Name: 69, dtype: object
[["Jason T. Smith", "Republican", 74.4], ["Dave Cowell", "Democratic", 22.7], ["Jonathan Shell", "Libertarian", 2.9]]

This one makes more sense to me - "Jason T. Smith" is what I have as the candidate's name (from Wikipedia). Since twitter search looks for an exact string match, it makes sense that the model would not have a good read on this candidate since it's unlikely for people to tweet a name with a middle initial.

Stay tuned: as a next step, I plan to collect the tweets for the month leading up to the 2018 congressional races and post my predictions on election day.

Written on October 28, 2018