Predicting A Movie's Opening Weekend Success Using Tweets
Can we create a model to predict the opening weekend box office success of a movie using tweets? Theoretically, it seems like it should be possible. A lot of people tweeting about a movie should indicate that it will be successful. Practically, is it possible to abstract meaning from data that is seemingly so noisy?
It seems daunting considering that with a simple strategy there will be plenty of instances where tweets are considered that are not even about the movie in question. For example, when you search for the movie 'Wild (2014)', you get some results like this:
no mum these are not hickeys i was attacked by a wild racoon get off my back
— Jaaack (@jackofhammond) December 12, 2014
I was undaunted by this challenge and dove straight in. In the end, without doing anything too sophisticated, I was able to produce a fairly decent model.
Obtaining the Box Office Data¶
I obtained data from boxofficemojo.com using the requests library and parsed it using BeautifulSoup4. I specifically worked with this data which is all Domestic (U.S.) movies which have revenue data for a Friday-Saturday-Sunday opening weekends. I grabbed the first 1000 from this list sorted by date with recent first. I made this decision based on the fact that Twitter's popularity has grown extreemly rapidly. I made the judgment that by around late 2009 there were been enough tweets to predict the success of a movie.
Obtaining the Twitter Data¶
I used Selenium to crawl Twitter's Advanced Search Engine and get the data. Using the API was out of the questions since the API only has tweets from the last 7 days. Scraping was a challenge because of the 'infinite scroll' feature of this site.
I used the 'This exact phrase' box for searching. Also, I did a little processing to the movies' titles. To avoid penalizing movies with long titles I stripped all years in parenthesis and subtitles. So for "Precious: Based on the Novel 'Push' by Sapphire" I searched for 'Precious'.
Because of time and resourse contraints (still working on my Linux skills for AWS) I constrained my data to tweets from the week before the opening of the movie.
If I made one search for each movie with the date range set to the whole week, I typically obtained a never ending stream of tweets from the first couple days before the opening. To try to capture the difference in volume of tweets across movies, I did 7 seperate searches - a day is as granular of a window you can search for - and told Selenium to scroll down the page at most 25 times for each search.
I ended up collecting data for 826 of the 1000 movies that I pulled from boxofficemojo.
A little cleaning¶
I want to focus on the modeling decisions with this post so I will skip to the point in which I aggregated all the data into one nice looking csv file.
import pandas as pd
pd.set_option('display.max_columns',None)
data = pd.read_csv('./../data/TwitterMovies/all_the_data.csv')
data.shape
Most recent data point:
data.head(1)
Least recent data point:
data.tail(1)
What the hell are all those columns you might ask. Well opening_wknd is the opening weekend gross which has not been adjusted for inflation. Retweets is currently a string that has the number of Retweets for each tweet that had any Retweets joined with a dash. Likes is the same except for Likes. And Text is a massive document of text from all the tweets that were obtained about the particular movie.
Here's a sample of text:
data.iloc[0].Text[0:150]
Number of characters in this particular document:
len(data.iloc[0].Text)
Alright, now, let's get the boring stuff out of the way. Let's get the Retweets and Likes into a more usable form.
import re
pattern = '[0-9]'
regex = re.compile(pattern)
data.Retweets = [[0]*(num_tweets-len([c for c in str(ret) if re.match(regex,c)])) + map(int,[c for c in str(ret) if re.match(regex,c)]) for ret,num_tweets in zip(data.Retweets,data.Num_Tweets)]
data.Likes = [[0]*(num_tweets-len([c for c in str(lik) if re.match(regex,c)])) + map(int,[c for c in str(lik) if re.match(regex,c)]) for lik,num_tweets in zip(data.Likes,data.Num_Tweets)]
data.head(1)
Now let's purge the Text data of all links and Twitter handles.
def cleanse_doc(doc):
    return ' '.join([word for word in doc.split() 
                     if 'http://' not in word and 'www.' not in word
                     and '@' not in word and 'https://' not in word
                     and '.com' not in word and '.net' not in word])
data.Text = [cleanse_doc(doc) for doc in data.Text]
data.iloc[0].Text[0:150]
One more boring but necessary step: adjust the revenue numbers for inflation. Since I have less than 7 years of data, I make the simplifying assumption that there has been a constant rate of inflation of this period (12% in total).
from datetime import date
def inflation_adjuster(dates,gross):
    inf_rate = .12/7.0
    dates = [date(*map(int,d.split('-'))) for d in dates]
    base = min(dates)
    today = max(dates)
    return [g * (1+inf_rate)**((today-base).days/365.0) / \
     (1+inf_rate)**((d-base).days/365.0) for d,g in zip(dates,gross)]
data.opening_wknd = inflation_adjuster(data.opening_date,data.opening_wknd)
data.tail(1)
Feature Engineering¶
Since the volume of Twitter data has increased so much during the last 7 years, including the time as a feature and looking at interactions will be essential.
data.insert(1,'days from today',[(date(2016,7,15) - date(*map(int,d.split('-')))).days for d in data.opening_date])
data.head(1)
Now, I don't quite know what summary statistics of the Retweets and Likes I should include. So, I'll just include a whole bunch and allow my model to do this work for me.
import numpy as np
def trimmed_mean(t):
    t = np.array(t)
    iqr = np.percentile(t,75) - np.percentile(t,25)
    med = np.median(t)
    t = t[np.where((t>=(med-1.5*iqr)) & (t<=(med+1.5*iqr)))]
    return np.mean(t)
data.insert(5,'Total_Retweets',[sum(ret) for ret in data.Retweets])
data.insert(6,'Median_Retweets',[np.median(ret) for ret in data.Retweets])
data.insert(7,'Mean_Retweets',data.Total_Retweets/data.Num_Tweets)
data.insert(8,'Trimmed_Mean_Retweets',[trimmed_mean(ret) for ret in data.Retweets])
data.insert(9,'90th_Percentile_Retweets',[np.percentile(ret,90) for ret in data.Retweets])
data.insert(10,'95th_Percentile_Retweets',[np.percentile(ret,95) for ret in data.Retweets])
data.insert(11,'98th_Percentile_Retweets',[np.percentile(ret,98) for ret in data.Retweets])
data.insert(12,'99th_Percentile_Retweets',[np.percentile(ret,99) for ret in data.Retweets])
data.insert(13,'99.5th_Percentile_Retweets',[np.percentile(ret,99.5) for ret in data.Retweets])
data.insert(14,'Max_Retweets',[max(ret) for ret in data.Retweets])
data.insert(16,'Total_Likes',[sum(lik) for lik in data.Likes])
data.insert(17,'Median_Likes',[np.median(lik) for lik in data.Likes])
data.insert(18,'Mean_Likes',data.Total_Likes/data.Num_Tweets)
data.insert(19,'Trimmed_Mean_Likes',[trimmed_mean(lik) for lik in data.Likes])
data.insert(20,'90th_Percentile_Likes',[np.percentile(lik,90) for lik in data.Likes])
data.insert(21,'95th_Percentile_Likes',[np.percentile(lik,95) for lik in data.Likes])
data.insert(22,'98th_Percentile_Likes',[np.percentile(lik,98) for lik in data.Likes])
data.insert(23,'99th_Percentile_Likes',[np.percentile(lik,99) for lik in data.Likes])
data.insert(24,'99.5th_Percentile_Likes',[np.percentile(lik,99.5) for lik in data.Likes])
data.insert(25,'Max_Likes',[max(lik) for lik in data.Likes])
data.head(1)
Now I'll use sklearn's TfidfVectorizer to convert my documents into text frequencies. I will use n-grams of lengths one, two, and three. In order to filter out some noise, I'm going to specify that for a feature to be made, the word or phrase must have been in at least 20% of the documents.
Also, from previous iterations, I learned that it is a good idea to tokenize the titles of the movies and include these tokens as stop words. The reason being that words that are in the titles of movies are considered to be some of the most important features in the Tf-idf world and including these words as features will make it more difficult for a model to pick out the truly generalizable word features.
If a token from the titles appears in at least 10% of the titles, I will allow it to be a feature. If not, it will be a stop word.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
titles = [title.lower() for title in data.title]
tfidf_titles = TfidfVectorizer(use_idf=True,stop_words=None,ngram_range=(1,3),max_df=.10)
tfidf_titles.fit_transform(titles);
stop_words = tfidf_titles.get_feature_names() + titles + list(ENGLISH_STOP_WORDS)
tfidf = TfidfVectorizer(use_idf=True,stop_words=stop_words,ngram_range=(1,3),min_df=.20)
X_text = tfidf.fit_transform(data.Text).toarray()
X_text.shape
I'll save the names of the text features for later.
d = tfidf.get_feature_names()
Modeling with just the Summary Statistics¶
The first step will be to split the data into a train and test set. I'll use a 70%, 30% train, test split.
np.random.seed(200)
train_inds = np.random.choice(range(len(data)),int(.7*len(data)),replace=False)
test_inds = [i for i in range(len(data)) if i not in train_inds]
y_train, y_test = data.opening_wknd[train_inds], data.opening_wknd[test_inds]
print 'Length of Training Set: %d.' % len(y_train),'Length of Test Set: %d.' % len(y_test)
Since I have a suspicion that some of my features might be redundant, I'll use the Lasso to encourage sparcity.
from sklearn.linear_model import LassoCV
lcv = LassoCV(n_alphas=100, fit_intercept=True, normalize=True, max_iter=1e7, cv=50, n_jobs=-1, random_state=3)
summary_stat_feature_inds = [1] + range(4,15) + range(16,26)
X_summary_stat = data.iloc[:,summary_stat_feature_inds]
X_summary_stat.head(3)
X_summary_stat.shape
lcv.fit(X_summary_stat.iloc[train_inds,:],y_train);
Let's take a look at how many of the 22 features have non-zero coefficients.
np.sum(lcv.coef_!=0)
Now how did the model perform? Let's look at the model's R-squared using the mean MSE across the 50 folds for the optimal choice of alpha.
1 - lcv.mse_path_.mean(1).min()/np.var(y_train)
Now let's visualize how this model did.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.figure(figsize=(10,5))
plt.plot(lcv.alphas_,lcv.mse_path_.mean(1),lw=5,c='purple',label='Mean of 50 Folds');
plt.axhline(np.var(y_train),label='MSE of Prediction with Mean',lw=2,ls='--',c='r');
plt.legend();
plt.ylim(0,12e14);
plt.ylabel('Mean Squared Error');
plt.xlabel('Lambda');
This model performed slightly better than using the mean as a predictor. Since we know that at least the interaction between volume of tweets and likes and retweets with the time feature should be predictive, let's try adding linear interaction terms.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_summary_stat = poly.fit_transform(X_summary_stat)
X_summary_stat.shape
lcv.fit(X_summary_stat[train_inds],y_train);
How many of the 253 features have non-zero coefficents?
sum(lcv.coef_!=0)
Let's see if the R-squared is any better.
1 - lcv.mse_path_.mean(1).min()/np.var(y_train)
It is. And the graph..
plt.figure(figsize=(10,5))
plt.plot(lcv.alphas_,lcv.mse_path_.mean(1),lw=5,c='purple',label='Mean of 50 Folds');
plt.axhline(np.var(y_train),label='MSE of Prediction with Mean',lw=2,ls='--',c='r');
plt.legend();
plt.ylim(0,12e14);
plt.ylabel('Mean Squared Error');
plt.xlabel('Lambda');
Modeling with just the Text Features¶
lcv.fit(X_text[train_inds],y_train);
How many non-zero features?
sum(lcv.coef_!=0)
R-Squared?
1 - lcv.mse_path_.mean(1).min()/np.var(y_train)
What are the most predictive features?
for ind in np.argsort(np.abs(lcv.coef_))[::-1][0:10]:
    print d[ind],lcv.coef_[ind]
Putting the Two Sets of Features Together¶
I do not want to take only the pre-selected features from the previous models and then run a model with just those features since that is one of the classic over-fitting pitfalls. Instead I will just run a single lasso with both sets of features.
In order to implement this strategy while avoiding overfitting, feature-selection would need to happen inside each fold of cross-validation. This seems beyond the scope of a simple post so I will not do this here.
X = np.hstack((X_summary_stat,X_text))
lcv.fit(X[train_inds],y_train);
Number of non-zero features?
sum(lcv.coef_!=0)
R-Squared?
1 - lcv.mse_path_.mean(1).min()/np.var(y_train)
Interestingly, the r-squared went down a tiny bit when I included the summary stat features.
Testing¶
I will end by simply predicting the hold-out movies, computing the R-Squared of the prediction and visualizing the predictions.
lcv.score(X[test_inds],y_test)
Not bad!
Here is a scatter plot of truth vs. predictions:
predictions = lcv.predict(X[test_inds])
fig,ax = plt.subplots(1,1,figsize=(8,8))
plt.scatter(y_test/1e8,predictions/1e8);
plt.plot([0,2.0],[0,2.0])
plt.xlabel('True Opening Weekend Gross',fontsize=20);
plt.ylabel('Predicted Opening Weekend Gross',fontsize=20);
plt.xlim(-.2,2);
plt.ylim(-.2,2);
plt.title('True vs. Predicted (Hundreds of Millions of Dollars)',fontsize=16);
ax.tick_params(axis='x', labelsize=14)
ax.tick_params(axis='y', labelsize=14);
The model seems to be struggling most with the largest grossing movies. This makes some sense since the distribution is very heavy-tailed.
Another way of quantifying the results, correlation between test predictions and test outcomes:
np.corrcoef(y_test,predictions)[0,1]
Conclusion¶
I believe I have demonstrated that Twitter can be a rich data source for predicting movies' success. Making this model better could be done by collecting more data - i.e. going further back than 7 days before the opening, and scrolling down more than 25 times for each search. I wonder if what was holding back the summary statistic features from being predictive was the small amount of data that was collected.
As for the text features, a simple improvement might be to add as stop words any words that are too specific to a few movies but are not in the titles of the movies - i.e. 'Bond'. A more complicated improvement might be to try to decipher whether a tweet is about the movie in question on a tweet by tweet basis.
More generally, I think I have shown that data from Twitter can definitely have predictive value.
Bonus material¶
Just for fun, let's fit a model to the whole data set and see the 20 text features with the largest positive coefficients, and the 20 with the largest negative coefficients.
lcv.fit(X_text,data.opening_wknd);
Top 20 positive features:
print ', '.join([d[ind] for ind in np.argsort(lcv.coef_)[::-1][0:20]])
Top 20 negative features:
print ', '.join([d[ind] for ind in np.argsort(lcv.coef_)[0:20]])