# Do Pro Golfers Feed off of their playing partners?

I was golfing the other day and was having a bad day. I was thinking about my experience golfing with different groups and that got me thinking about "feeding off" the play of my fellow group members. I decided it would be fun to investigate whether the so called "feed-off effect" is real. I was given access to the data from the PGA Tour through the ShotLink Intelligence Program so I decided to use this data in an analysis of the pro game (data from my personal game is lacking).

This feed-off effect is something that is often mentioned on telecasts and in conversations about the game and is often supported with anecdotes. This analysis intends to answer the basic question of whether there is statistically significant evidence of this effect existing. I will do this by seeing if the within group variation or spread is significantly less substantial than the between group variation during a given round. If the effect does exists, I will go on to investigate other questions that might be interesting like - does it exists especially at certain tournaments? On certain courses? For certain players? During certain times? Under certain circumstances? Golf geeks rejoice!

```
## the standard imports
import pandas as pd
pd.options.display.max_columns = 999
import numpy as np
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
```

```
## reading in round level data
with warnings.catch_warnings():
warnings.simplefilter("ignore")
round_data = pd.concat([pd.read_csv('../data/rawdata/%dr.txt'
% year, sep=';')
for year in range(2003,2017)])
```

```
## taking a look at the dimensions
round_data.shape
```

```
## making the columns easier to work with
round_data.columns = [col.replace(' ','_')
for col in round_data.columns]
```

```
## look at the first 3 rows
round_data.head(3)
```

Alright, that is a lot of columns! We probably only need a handful of them so I will choose some of them and delete the rest.

```
round_cols = ['Tournament_Year','Permanent_Tournament_#','Event_Name',
'Course_#','Course_Name','Player_Number','Player_Name',
'Round_Number','Tee_Time','Round_Score',
'End_of_Round_Finish_Pos._(text)']
for col in round_data.columns:
if col in round_cols:
continue
del round_data[col]
```

```
## new dimensions
round_data.shape
```

```
## first three rows
round_data.head(3)
```

Ok now, there is a slight problem. We have the tee times for each player but to know which groups the players were in, we actually need to know what hole they started on as well. I will demonstrate this fact by doing a groupby. But first, we need to do a quality check on the *Tee_Time* field.

```
## value counts of 'earliest' tee times
round_data.Tee_Time.value_counts().sort_index().head(3)
```

```
## value counts of 'latest' tee times
round_data.Tee_Time.value_counts().sort_index().tail(3)
```

```
## what is with the tee times and midnight and some made up time 99:99?
print round_data[round_data['Tee_Time']=='00:00']\
[['Tournament_Year','Event_Name']]\
.groupby(['Tournament_Year','Event_Name']).size()
print round_data[round_data['Tee_Time']=='99:99']\
[['Tournament_Year','Event_Name']]\
.groupby(['Tournament_Year','Event_Name']).size()
```

```
## it seems like there is something up with the Masters in 2003
round_data.loc[(round_data.Tournament_Year==2003)
& (round_data.Event_Name=='Masters Tournament'),
'Tee_Time'].value_counts()
```

```
## I will drop this data since it seems to be missing the Tee Times
round_data = round_data[~((round_data.Tournament_Year==2003)
& (round_data.Event_Name==
'Masters Tournament'))]
```

```
## Now ready for groupby tee time
## groupby tee time, count number of players in each tee time
round_data.groupby(['Tournament_Year',
'Permanent_Tournament_#','Course_#',
'Round_Number','Tee_Time'])\
.size().head().to_frame()\
.rename(columns={0:'Player_Count'})
```

PGA Tour groups never have more than 4 (almost positive of this fact, aside from playoffs) and rarely have more than 3. 6 players with the same tee time is actually two different groups - one going off of hole 1 and another off of hole 10. During the first two rounds of a tournament, going off both 1 and 10 - "split tees" - is common. It is much more infrequent during the last two rounds of a tournament; it typically only happens when weather becomes a concern. I'll demonstrate this by grouping by Round_Number in the previous data frame and taking the mean of the count variable.

```
## groupby Round_Number take mean of players per tee time
round_data.groupby(['Tournament_Year','Permanent_Tournament_#',
'Course_#','Round_Number','Tee_Time']).size()\
.reset_index().rename(columns={0:'Count'}).groupby('Round_Number',
as_index=False)['Count'].agg(['count','mean'])\
.rename(columns={'count':'number of tee times',
'mean':'mean players per tee time'})
```

So we see that in rounds 3 and 4 there are rarely "split tees" since the number of players per tee time is closer to the number of players we expect to see per group - 2 or 3.

The tee times in rounds 6 and 7 are for the Match Play Championship and this data is irrelevant for the investigation at hand and unusable since there is no such thing as a Round Score for match play. So I will drop this data now.

```
## checking out what events have data for round 6 or 7
round_data.loc[round_data.Round_Number>5,
['Event_Name','Permanent_Tournament_#']]\
.groupby(['Event_Name','Permanent_Tournament_#'])\
.size().to_frame().rename(columns={0:'Round_Count'})
```

```
## drop this data
round_data = round_data.loc[round_data['Permanent_Tournament_#']!=470]
```

The problem is there is no column in this particular table that has the information of what hole the player started on for the round. Without this information, it is impossible to determine the groups. We will need to find this information in the hole-level table. I will import this data and we will have a look.

```
## reading in hole level data and making columns easier to work with
with warnings.catch_warnings():
warnings.simplefilter("ignore")
hole_data = pd.concat([pd.read_csv('../data/rawdata/%dh.txt'
% year, sep=';')
for year in range(2003,2017)])
hole_data.columns = [col.replace(' ','_')
for col in hole_data.columns]
```

```
## dimensions of hole-level data
hole_data.shape
```

```
## first three rows of hole-level data
hole_data.head(3)
```

```
## re-naming Tournament_# field so it is compatible with other table
hole_data = hole_data.rename(columns={'Permanent_#':'Permanent_Tournament_#'})
## selecting only the columns that we'll need, deleting rest
hole_cols = ['Tournament_Year','Permanent_Tournament_#',
'Player_Name','Player_#','Course_#','Hole_#',
'Round_#','Hole_Seq_#','Score']
for col in hole_data.columns:
if col in hole_cols:
continue
del hole_data[col]
## removing match play championship
hole_data = hole_data[hole_data['Permanent_Tournament_#']!=470]
## removing the masters in 2003
hole_data = hole_data[~((hole_data.Tournament_Year==2003) &
(hole_data['Permanent_Tournament_#']==14))]
```

```
hole_data.head(3)
```

This is the information we need. If we just subset only the rows that contain the first hole played by each golfer and make a dictionary with a key being a tuple that uniquely identifies a player-round and a value being the first hole played by the golfer in the round, this will be easy to put into the round data frame. I'll do this now after verifying the quality of the Hole_Sequence field.

```
## value counts of the sequence field
hole_data['Hole_Seq_#'].value_counts().sort_index()
```

I find this data strangely perfect. Don't some players withdraw in the middle of the round? I'll continue by checking the *Score* field.

```
## value counts of score
hole_data['Score'].value_counts().sort_index()
```

```
## who made a 16 on a hole?
hole_data.sort_values('Score',ascending=False).head(1)
```

```
## slight diversion let's check out the value counts of Round_Score
round_data.Round_Score.value_counts().sort_index()
```

```
## who shot 58?
round_data.sort_values('Round_Score')\
[['Player_Name','Tournament_Year',
'Event_Name','Round_Score']].head(1)
```

```
## who shot 106??
round_data.sort_values('Round_Score',ascending=False)\
[['Player_Name','Tournament_Year',
'Event_Name','Round_Score']].head(1)
```

```
## subsetting to only first hole of round,
## dropping irrelevant fields, making dictionary
hole_data = hole_data[hole_data['Hole_Seq_#']==1]
hole_data.drop(['Player_Name','Score','Hole_Seq_#'],
axis=1,inplace=True)
hole_data.set_index(['Tournament_Year','Permanent_Tournament_#',
'Player_#','Course_#','Round_#'],inplace=True)
start_hole = hole_data.squeeze().to_dict()
```

```
## how many rounds started on each hole?
hole_data.squeeze().value_counts().to_frame().sort_index()
```

I'm not exactly sure why there are a few abnormalities with rounds starting on holes 9 and 11. I looked into it and found no evidence that these rounds started on holes 9 or 11. I will leave this for now and see if it is helpful in distinguishing between the different groups by grouping by group and counting the number of players later.

I'll now insert a new column for hole started on which will be used together with tee time to determine the groups.

```
## inserting Hole Started On field
round_data['Hole_Started_On'] = [start_hole[(year,tourn,player
,course,round_)]
if (year,tourn,player,
course,round_) in start_hole
else np.nan
for (year,tourn,
player,course,round_) in
zip(round_data.Tournament_Year,
round_data['Permanent_Tournament_#'],
round_data['Player_Number'],
round_data['Course_#'],
round_data['Round_Number'])]
```

```
## see how many rounds weren't found in hole data
round_data.Hole_Started_On.value_counts(dropna=False)\
.to_frame().sort_index()
```

```
## dropping the few rounds that were not matched
round_data = round_data.dropna(subset=['Hole_Started_On'])
```

Now we are ready to group by the same columns as before plus *Hole_Started_On* to see the number of players in each group..

```
## Now ready for groupby tee time + hole_started_on
round_data.groupby(['Tournament_Year','Permanent_Tournament_#',
'Course_#','Round_Number',
'Tee_Time','Hole_Started_On'])\
.size().head().to_frame()\
.rename(columns={0:'Player_Count'})
```

```
## groupby Round_Number take mean of players per group
round_data.groupby(['Tournament_Year','Permanent_Tournament_#',
'Course_#','Round_Number',
'Tee_Time','Hole_Started_On']).size()\
.reset_index().rename(columns={0:'Count'}).groupby('Round_Number',
as_index=False)['Count'].agg(['count','mean'])\
.rename(columns={'count':'number of groups',
'mean':'mean players per group'})
```

These number look good. Let's check and make sure that there are no remaining abnormalities in the group size.

```
## value counts of sizes of groups
round_data.groupby(['Tournament_Year','Permanent_Tournament_#',
'Course_#','Round_Number',
'Tee_Time','Hole_Started_On']).size()\
.reset_index().rename(columns={0:'Num_Groups'})\
.Num_Groups.value_counts().to_frame().sort_index()
```

I'm going to assume that groups with sizes of 4 and up are mistakes. I'm almost positive that groups with sizes 5 or 6 are mistakes that result from incorrect *Hole_Started_On* values. I'm less confident that there were never foursomes but there are so few of these groups that I am confident that the results will not be affected if I toss these out so I will do so. Also I need to toss out all singles (groups of one player) since they will be useless in this analysis.

```
## tossing out groups with >3 players
group_sizes = round_data.groupby(['Tournament_Year',
'Permanent_Tournament_#',
'Course_#','Round_Number',
'Tee_Time','Hole_Started_On'])\
.size().to_dict()
to_keep_mask = [1<group_sizes[(year,tourn,course,round_,
teetime,holestart)]<4
for year,tourn,course,round_,teetime,holestart in
zip (round_data.Tournament_Year,
round_data['Permanent_Tournament_#'],
round_data['Course_#'],
round_data['Round_Number'],
round_data['Tee_Time'],
round_data['Hole_Started_On'])]
round_data = round_data.loc[np.array(to_keep_mask)]
```

```
## value counts of sizes of groups (again)
round_data.groupby(['Tournament_Year','Permanent_Tournament_#',
'Course_#','Round_Number',
'Tee_Time','Hole_Started_On']).size()\
.reset_index().rename(columns={0:'Num_Groups'})\
.Num_Groups.value_counts().to_frame().sort_index()
```

Now we are ready to do some statistics. Each group is identified by a 6-tuple.

### Stats¶

So the question we want to know is if players within a group feed off each other. We want to measure some kind of success variable for each player and see if the within group variability is significantly smaller than the between group variability. There are a few different ways of doing this as is often the case. One is one-way ANOVA. With one-way ANOVA the goal is to test whether the means of 2 or many groups are different from one another. This classical statistical test has the following assumptions:

- The samples are independent.
- Each sample is from a normally distributed population.
- The population standard deviations of the groups are all equal. This property is known as homoscedasticity.

(lifted from here)

After thinking about these assumptions, it is apparent that with the simple gross score of each player being the success variable, the samples are not independent because the groups are not made randomly. For example, during the third and forth rounds of tournaments, the field is ordered by position and the players are paired with players in the most similar position for the tournament. This measure is therefore problematic because you should expect players who've been doing well to continue doing well compared to the players who are at the bottom of the field. In other words, a significant F-Statistic after running ANOVA using the gross score would not necessarily be indicative of the feed-off effect we are looking for because we should expect a difference in means through the process by which the groups are made.

A better success variable would be a difference between what is expected of the player and how the player actually does. By controlling for the expected performance, this new success variable should be independent for each group (or sample). For this we need to make some sort of model to compute an expected performance. First let's use Strokes Gained to the Field as a measure of success for a given round. Strokes Gained to the Field is just the average *Round_Score* of all the players for the round minus the *Round_Score* for each player. This enables us to compare results across different rounds by controlling for the difficulty of the course and conditions of each round (we'll ignore the possibility of the strength of the field changing from tournament to tournament for simplicity). Let me start by simply adding an average score for each year-tournament-course-round combination as a column along with Strokes Gained to the Field.

```
## inserting ave score for each course-round combo
ave_scores = round_data[['Tournament_Year','Permanent_Tournament_#',
'Course_#','Round_Number','Round_Score']]\
.groupby(['Tournament_Year','Permanent_Tournament_#',
'Course_#','Round_Number'])['Round_Score']\
.mean().to_dict()
round_data['Field_Ave'] = [ave_scores[(year,tourn,course,round_)]
for year,tourn,course,round_ in
zip(round_data.Tournament_Year,
round_data['Permanent_Tournament_#'],
round_data['Course_#'],
round_data['Round_Number'])]
## inserting Strokes Gained to the Field
round_data['Strokes_Gained'] = round_data.Field_Ave - round_data.Round_Score
```

```
## histogram for fun
round_data.Strokes_Gained.hist(bins=25,figsize=(15,5));
```

```
## first 3 rows of dataframe
round_data.head(3)
```

Now for our 'Model'. Again for simplicity, I will just use a exponentially weighted moving average for the expected Strokes Gained to the Field of the golfer. This has the benefit of being an intuitively appealing method and being easy to do in pandas. The data is actually already in chronological order which is convenient. I'll fit a half-life parameter to the data by choosing the parameter for which the prediction (the exponentially weighted moving average of past strokes gained results) is most correlated with the outcome (strokes gained for the round). I won't worry about over-fitting the training data since it's just one parameter.

```
## recording correlation with future results of ewma precition
results = []
for halflife in np.linspace(2,80,40):
df = round_data[['Player_Number','Strokes_Gained']].copy()
df['prediction'] = df.groupby('Player_Number')['Strokes_Gained']\
.transform(lambda x:
x.shift(1)\
.ewm(halflife=halflife).mean())
df = df.dropna()
results.append(np.corrcoef(df.Strokes_Gained,df.prediction)[0,1])
```

```
## plotting halflife vs. correlation
fig,ax = plt.subplots(1,1,figsize=(15,7))
plt.plot(np.linspace(2,80,40),results,lw=3);
plt.ylim(0.17,0.25);
plt.axvline(np.linspace(2,80,40)[np.argmax(results)],
c='black',ls='--');
plt.xlabel('Halflife',fontsize=18);
plt.ylabel('Correlation with Future Strokes Gained',fontsize=18);
ax.tick_params(labelsize=16);
```

Now I'll insert this "model"'s prediction into the dataframe for the expected Strokes Gained of the Player. I'll also insert the different between his Strokes Gained performance for the day and what was expected of him. This will give us a more useful result variable.

```
## inserting Expected Result and difference bettwen expected and true
halflife = np.linspace(2,80,40)[np.argmax(results)]
round_data['Exp_StrokesGained'] = round_data.groupby('Player_Number')\
['Strokes_Gained']\
.transform(lambda x: x.shift(1)\
.ewm(halflife=halflife).mean())
## first round for each player is null because no data
## so filling with 0 which is saying he is expected to tie field
round_data.Exp_StrokesGained.fillna(0,inplace=True)
round_data['Net_Result'] = round_data['Strokes_Gained'] - round_data['Exp_StrokesGained']
```

```
## histogram of net result variable
round_data.Net_Result.hist(bins=20,figsize=(15,5));
```

```
## standard deviation before prediction and residuals
## shows the prediction is somewhat useful in explaining
## some variation
round_data.Strokes_Gained.std(),round_data.Net_Result.std()
```

Now we are all ready to do some statistical tests. We have a result variable and have made it so that our samples are independent by controlling for the overall skill of each player. We can try one-way ANOVA first. Let's make a nested list that we'll feed to the scipy ANOVA function.

```
## making a list of arrays for anova function
nested_list = []
for tup,df in round_data.groupby(['Tournament_Year',
'Permanent_Tournament_#',
'Course_#','Round_Number',
'Tee_Time','Hole_Started_On']):
nested_list.append(df.Net_Result.values)
```

```
## first sample
nested_list[0]
```

```
## making sure we have groups of 2 or 3 only
all(len(list_)==2 or len(list_)==3 for list_ in nested_list)
```

```
## import and feed scipy anova function
from scipy.stats import f_oneway
f_stat,p_value = f_oneway(*nested_list)
```

```
## look at results
f_stat,p_value
```

So these are very strong results. I am cautious however since I am not sure that all of the three assumptions that are necessary for this test to be valid are true. I am confident about the independence assumption, and decently confident about the homoskedasticity assumption, but less confident about the normality assumption.

As an alternative test we can do the Kruskal-Wallis H-test. This test is for the null hypothesis that the medians of the samples are the same. It is a non-parametric version of ANOVA and only assumes that the samples are independent of one another (which I believe is true). Let's see the results of this test..

```
## import and feed scipy kruskal-wallis function
from scipy.stats import kruskal
stat,p_value = kruskal(*nested_list)
```

```
## look at results
stat,p_value
```

Another strong result! This is very interesting but still the result is not very tangible. To investigate the magnitude of the effect and visualize how robust it is, let's do a permutation test. First let's define some closeness-within-a-group metric. I will use the average distance between each of the possible pairs of players within a group (1 pair if it is a twosome and 3 pairs if it is a threesome) as this metric and take the average across all groups as the metric that tells me how close the groups are on average. First let's compute this metric for the real, unpermuted, data.

```
## import pdist to compute pairwise distances
from scipy.spatial.distance import pdist
```

```
## compute metric for real data
metric_unpermuted = np.mean([np.mean(pdist(sample[:,None]))
for sample in nested_list])
```

```
## view the metric
metric_unpermuted
```

Now we will randomly permute the values of the *Net_Result* variable many times and then compute the value of the metric of interest each time. This will enable us to see how likely our result would have happened if the groups were done completely randomly. We will also be able to visualize the metric computed on the real data in the context of random configurations of the groups.

Since all that matters is the number of twosomes and the number of threesomes for defining the groups after the permutation, I can avoid doing this explicitly with the same groupby. I will do it by extracting the values from the dataframe and using some handy numpy functions.

```
## define metric function
def compute_metric(samples):
return np.mean([np.mean(pdist(sample[:,None]))
for sample in samples])
```

```
## extract values and number of twosomes and number of threesomes
result_values = round_data.Net_Result.values
counts = round_data.groupby(['Tournament_Year',
'Permanent_Tournament_#',
'Course_#','Round_Number',
'Tee_Time','Hole_Started_On']).size()\
.reset_index().rename(columns={0:'Num_Groups'})\
.Num_Groups.value_counts()
num_twos,num_threes = counts[2],counts[3]
```

```
## view results
num_twos,num_threes
```

```
## indicies to split array at (no need for last index)
indicies = np.cumsum(np.concatenate([[2]*num_twos,
[3]*num_threes]))[:-1]
```

```
## permute 10,000 times and compute metric each time
metrics_permuted = []
for i in xrange(10000):
result_values = np.random.permutation(result_values)
samples = np.split(result_values,indicies)
metrics_permuted.append(compute_metric(samples))
```

```
## make fancy plot of results
fig,ax = plt.subplots(1,1,figsize=(16,10))
plt.violinplot(metrics_permuted,vert=False);
plt.axvline(metric_unpermuted,lw=2.3,
label='Value that Occured in Reality');
plt.xlabel('Average of average pairwise distance within group',
fontsize=20);
plt.ylabel('Density',fontsize=20);
plt.title('''Distribution of Metric Occuring by Chance
Along with Value that Occured in Reality''',fontsize=23);
ax.tick_params(labelsize=16);
plt.legend(loc=2,prop={'size':16});
```

So the above plot shows that even the most extreme value on the low end of the spectrum of the metric that occurred by chance is about 5 standard deviations from the value that occurred from the groups in reality. This is another strong result.

### Conclusion¶

The permutation test along with the statistical tests allows me to confidently conclude that PGA Tour players display a feed-off effect where players playing in the same group perform more similarly to each other that one would expect to see according to random chance. Although the size of the effect is only about 0.06 strokes different from what one would expect by chance, measured in terms of the average pairwise distance between all possible pairs within a group, the effect is astronomically unlikely to be attributable to chance.

I will end this post here and wait to explore questions such as the ones mentioned at the beginning for part two.

I will leave it to the reader to discern the psychological mechanisms behind this effect. In my mind there is a moral behind this analysis - even from a purely competitive perspective, you should try to establish a good rapport with your playing partners because what's good for them is going to be good for you as the scores of group-members are tied together in a statistically significant way.