Do NBA Game 7s Have Lower Shooting Percentages?

NBA game 7s

Recently I watched Game 7 between the Sixers and the Rapters and I heard the announcers talk about how shooting accuracy in game 7s is notoriously poor. I suspect that when sports announcers make these sorts of remarks they most likely haven't tested their hypothesis with data. As a sports fan, I do appreciate these remarks and do give them the benefit of the doubt since they're often coming from former players. As a data analyst, I also enjoy these kinds of remarks since they often make of testable hypothesis and sports data is great because it's often publicly available. In this case, Basketball Reference has the box scores of every game of every NBA playoff series of all-time neatly organized and accessible.

So I decided to gather the data and test the hypothesis that shooting in Game 7s - or final games of series more generally - is appreciably worse than earlier games in series. Specifically I am interested in comparing the mean shooting percentage in these games to the mean shooting percentages to the earlier games.

I will first put the code that I used to assemble the data and then show the analysis I did and the results.

In [1]:
## imports

import requests
from scipy.stats.kde import gaussian_kde
from collections import defaultdict
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from time import sleep
import warnings
warnings.simplefilter("ignore")
In [2]:
## collect the data

# base_url = 'https://www.basketball-reference.com'

# def sum_from_record(s):
#     return sum(map(int,s.split('-')))

# def max_from_record(s):
#     return max(map(int,s.split('-')))

# def get(url):
#     max_tries = 5
#     tries = 0
#     while tries<max_tries:
#         tries += 1
#         try:
#             r = requests.get(url, verify=False, timeout=10)
#             return r
#         except:
#             continue
#     raise RuntimeError
        

# rows = []
# for year in range(1950,2020):
#     print(year)
#     r = get('%s/playoffs/NBA_%d.html' % (base_url,year,))
#     assert r.status_code==200
#     soup = BeautifulSoup(r.content)
#     serieses = soup.find_all('a',text='Series Stats')
#     for series in serieses:
#         r = get('%s/%s' % (base_url,series['href']))
#         assert r.status_code==200
#         soup = BeautifulSoup(r.content)
#         header = soup.find('h2')
#         num_games = sum_from_record(header.text.split('vs.')[1].split('(')[1].split(')')[0])
#         best_of_series = max_from_record(header.text.split('vs.')[1].split('(')[1].split(')')[0])
#         games = soup.find_all('a',text='Final')
#         assert len(games)==num_games
#         for i,game in enumerate(games):
#             sleep(1.9)
#             r = get('%s/%s' % (base_url,game['href']))
#             assert r.status_code==200
#             soup = BeautifulSoup(r.content)
#             bs_tables = soup.find_all('table')
#             if 'Advanced Box Score' in str(bs_tables[1]):
#                 bs_tables = [bs_tables[0],bs_tables[2]]
#             else:
#                 bs_tables = bs_tables[:2]
            
#             pd_tables = list(map(lambda x: pd.read_html(str(x))[0],bs_tables))
#             assert len(pd_tables)==len(bs_tables)
#             team1, team2 = list(map(lambda x: x.find('caption').text.split('(')[0].split(' Table')[0],bs_tables))
#             sum1, sum2 = list(map(lambda x: sum_from_record(x.find('caption').text.split('(')[1].split(')')[0])
#                                   if '(' in x.find('caption').text else 1,
#                                   bs_tables))
#             assert sum1==sum2==i+1
            
#             table1, table2 = pd_tables
            
#             made1, attempted1 = table1.iloc[-1,2], table1.iloc[-1,3]
#             if not isinstance(made1,float) and not isinstance(attempted1,float):
#                 rows.append([year,team1,team2,best_of_series,i+1,num_games,int(made1),int(attempted1)])
            
#             made2, attempted2 = table2.iloc[-1,2], table2.iloc[-1,3]
#             if not isinstance(made2,float) and not isinstance(attempted2,float):
#                 rows.append([year,team2,team1,best_of_series,i+1,num_games,int(made2),int(attempted2)])

# data = pd.DataFrame(rows,columns=['year','oteam','dteam','best_of_series','game_num',
#                                   'n_game_series','makes','attempts'])
In [3]:
## take a look at the data
# data.head()
In [4]:
## looking at the types of the columns
# data.dtypes
In [5]:
## store the data for later
# data.to_csv('playoff_games.csv',index=False)
In [6]:
## reading the data in and taking a look at the most recently played games
data = pd.read_csv('playoff_games.csv').sort_values('year').reset_index(drop=True)

data.tail(8)
Out[6]:
year oteam dteam best_of_series game_num n_game_series makes attempts
7453 2019 Portland Trail Blazers Denver Nuggets 4 7 7 38 93
7454 2019 Indiana Pacers Boston Celtics 4 3 4 34 81
7455 2019 Los Angeles Clippers Golden State Warriors 4 5 6 46 85
7456 2019 Golden State Warriors Los Angeles Clippers 4 5 6 43 96
7457 2019 Philadelphia 76ers Toronto Raptors 4 7 7 28 65
7458 2019 Milwaukee Bucks Detroit Pistons 4 1 4 44 90
7459 2019 Portland Trail Blazers Oklahoma City Thunder 4 4 5 37 90
7460 2019 Golden State Warriors Portland Trail Blazers 3 3 3 41 84

The way I have organized the data is that each playoff game has two rows - one for each of the teams' shooting performances. The row order got jumbled up during the data collection process but, for example, for game 4 of the Blazers vs. Thunder, there are two rows:

In [7]:
data.loc[(data['year']==2019) &
         (data['oteam'].str.strip().isin(['Portland Trail Blazers',
                                          'Oklahoma City Thunder'])) &
         (data['dteam'].str.strip().isin(['Portland Trail Blazers',
                                          'Oklahoma City Thunder'])) &
         (data['game_num']==4)]
Out[7]:
year oteam dteam best_of_series game_num n_game_series makes attempts
7351 2019 Oklahoma City Thunder Portland Trail Blazers 4 4 5 33 88
7459 2019 Portland Trail Blazers Oklahoma City Thunder 4 4 5 37 90

Astute readers might have noticed that according to the data above, the Warrior-Blazers series this year is a need-3-to-win series. Which is in fact not true. It's a result of when I collected the data (I did it on the 19th of May which is before game 4 had happened). I am not going to worry about it.

To get at this question of whether shooting performance in final games of series is worse than the earlier games, let's first plot the distribution of shooting performance for each game number. In order to compare apples and apples, I have grouped the data and shown one plot for each possible length of a series. Yes, apparently there were some 1-game series way back when. These seem to have been some sort of tiebreaker to get into the playoffs.

In [8]:
## store every single shooting percentage organized
## by series length and games number
results1 = defaultdict(dict)
sample_sizes = defaultdict(dict)
for series_length, sub in data.groupby('best_of_series'):
    for game_num, sub2 in sub.groupby('game_num'):
        results1[series_length][game_num] = (sub2.makes/sub2.attempts).values*100
        sample_sizes[series_length][game_num] = len(sub2)
In [9]:
## make a plot of the distributions using gausian kde
## for estimating a probability density function
fig, ax = plt.subplots(2,2,figsize=(20,15))
fig.suptitle('Probability Density of Shooting Percentage',fontsize=24,y=.95)
ax = ax.flat

for i,series_length in enumerate(results1):
    ax[i].set_title('Wins Needed to Win Series: %d' % (series_length,),fontsize=18)
    ax[i].set_xlabel('Shooting Percentage',fontsize=13,labelpad=10)
    ax[i].set_ylabel('Probability Density',fontsize=13,labelpad=10)
    ax[i].tick_params(labelsize=14);
    
    kdes = {game_num:gaussian_kde(x) for game_num,x in results1[series_length].items()}
    dist_spaces = {game_num:np.linspace(min(x)-3,max(x)+3,200) for game_num,x in results1[series_length].items()}
    
    game_nums = results1[series_length].keys()
    for game_num in game_nums:
        ax[i].plot(dist_spaces[game_num],kdes[game_num](dist_spaces[game_num]),
                              label='Game %d, n=%d' % (game_num,sample_sizes[series_length][game_num]))
        ax[i].legend(loc=2,prop={'size':14});

plt.show();

Two things stick out to me from the above plots:

  • You can't see too much difference between the distributions of the early games of series and the final games of series. Maybe, if you squint, Game 7s' mass looks a little left of the masses of Games 1-6 in the final chart. However, it's difficult to discern whether there is a significant but small difference because all the distributions are so heavy-tailed.
  • It's really interesting to see that some of the distributions are bimodal. I was not expecting this but it has a very natural explanation. I won't bother to see if that explanation is backed by evidence since we have more important things to get to!

What we'd really like to know is if the expected (or mean) shooting percentage is appreciably different for these last games of series. So, let's start by looking at the means, broken down by series length and game number as before:

In [10]:
## store means of shooting percentage organized
## by series length and games number
rows = []
for series_length, sub in data.groupby('best_of_series'):
    for game_num, sub2 in sub.groupby('game_num'):
        rows.append((series_length,game_num,
                     len(sub2),(sub2.makes/sub2.attempts).mean()*100))
    
pd.DataFrame(rows,columns=['series length','game number',
                           'number of games',
                           'mean of shooting percentage'])
Out[10]:
series length game number number of games mean of shooting percentage
0 1 1 4 38.260184
1 2 1 72 45.095347
2 2 2 72 45.601041
3 2 3 42 46.047981
4 3 1 336 46.302672
5 3 2 332 46.089285
6 3 3 332 45.821556
7 3 4 228 45.191655
8 3 5 116 45.111359
9 4 1 1042 45.187437
10 4 2 1045 45.399661
11 4 3 1052 44.787889
12 4 4 1046 45.115620
13 4 5 885 44.980855
14 4 6 593 44.569084
15 4 7 264 44.040490
  • For series where you have to win 2: the opposite of the hypothesis seems to be true: the shooting percentage for Game 3s is higher than that of Game 1s and Game 2s. However, the sample size is small so it's difficult to draw any conclusion.
  • For series where you have to win 3: Game 5s do indeed have the lowest mean shooting percentage. But how can we know if it's an "appreciable" difference?
  • For series where you have to win 4: same conclusion - Game 7s do indeed have the lowest mean shooting percentage, but is the difference "appreciable"?

To answer the question of whether the difference in means is "appreciable", many people would turn to a T-test to get an idea of the liklihood of the data coming to fruition if the populations were the same (the null hypothesis).

Personally, I prefer an alternative to this which is called resampling. It is a non-parametric method to compare a statistic between two samples. The idea is to simulate different possible versions of each sample by recreating it using random sampling with replacement. It's sort of akin to the many worlds theory. For each of the simulated samples, you compute the statistic you are interested in. What results is a distribution of the statistic you are interested in which gives a sense of its volatility for your data. What I really like about this approach is, after you do this for each of your groups, you can visualize the results and essentially perform the test with your eyes.

Let's see how the results look for this case:

In [11]:
## store mean of shooting percentage 
## for each of 10,000 simulated samples
## for each series length and games number
results2 = defaultdict(lambda: defaultdict(list))
sample_sizes = defaultdict(dict)
its = 1000
for series_length, sub in data.groupby('best_of_series'):
    for game_num, sub2 in sub.groupby('game_num'):
        for _ in range(its):
            sample = sub2.sample(frac=1.0,replace=True)
            results2[series_length][game_num].append((sample.makes/sample.attempts).mean()*100)
            sample_sizes[series_length][game_num] = len(sub2)
In [12]:
## make a plot of the distributions of means using gausian kde
## for estimating a probability density function
fig, ax = plt.subplots(2,2,figsize=(20,15))
fig.suptitle('Probability Density of Mean of Shooting Percentage, Resampled',fontsize=24,y=.95)
ax = ax.flat

for i,series_length in enumerate(results2):
    ax[i].set_title('Wins Needed to Win Series: %d' % (series_length,),fontsize=18)
    ax[i].set_xlabel('Shooting Percentage',fontsize=13,labelpad=10)
    ax[i].set_ylabel('Probability Density',fontsize=13,labelpad=10)
    ax[i].tick_params(labelsize=14);
    
    kdes = {game_num:gaussian_kde(x) for game_num,x in results2[series_length].items()}
    dist_spaces = {game_num:np.linspace(min(x)-.5,max(x)+.5,200) for game_num,x in results2[series_length].items()}
    
    game_nums = results2[series_length].keys()
    for game_num in game_nums:
        ax[i].plot(dist_spaces[game_num],kdes[game_num](dist_spaces[game_num]),
                              label='Game %d, n=%d' % (game_num,sample_sizes[series_length][game_num]))
        ax[i].legend(loc=2,prop={'size':14});

plt.show();

Now we can really answer our query with one single picture!

Here are my personal takeaways from the picture:

  • For the best-of-3 game series (upper-right), we can see there's a large overlap between the distributions of means. This fits well with what we thought from before - the sample size is too small to really draw a conclusion for this set.
  • For the best-of-5 game series (bottom-left), we can see there is some decent differentiation between the earlier and the later games in the series. For example, the distribution for Game 5s overlaps a only slightly with the distribution for Game 1s.
  • For the best-of-7 game series (bottom-right), the largest of the groups in terms of sample size, there is some appreciable difference between Game 7s and the earlier games. For example, the distribution for Game 7s barely overlaps at all with the distributions for Game 1s and Game 2s. It's also nice to see that there is a bit of differentiation for Game 6s as well - it's always nice to see a dose-response relationship.

Overall I think there is pretty solid (but not overwhelming) evidence that games later in series see poorer shooting percentage than games earlier in series. The announcers were right! As to why they are right - is is fatigue? defenses becoming more familiar with offenses? nerves? - I am not sure. Probably a mix of different factors.

Written on May 26, 2019