Public Data Munging - Exploring the Gender Wage Gap for Iowa State Employees
I recently listened to a podcast that talked about all the data that has been made available as a result of the open data intiative championed by the Obama Administration. Data.gov contains a lot of random datasets. Most seem fairly uninteresting because they have been aggregated or summarized before being published. However, there are literally thousands of datasets - the quantity is overwhelming.
Since Trump is actually going to be president, there is no gaurentee that this data will stay available. Therefore, carpe data! I pulled an interesting looking data set - the salary information on state employees of Iowa (link). After looking at the fields made available, I think it would be interesting to munge some salary data in order to explore the income gap by gender. How much money are females making in the workplace compared to their male counterparts? Are there particular disciplines where they are doing better or worse than others? Let's explore.
import pandas as pd
import numpy as np
from scipy.stats import spearmanr
import re
from collections import OrderedDict
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('./../data/Iowa_Salaries'+\
'/State_of_Iowa_Salary_Book.csv')
data.shape
data.head(3)
So, in accordance with the principle that all datasets have their quirks, the 'Base Salary' field, which is probably the most interesting field, is replace by 'TERMINATED' if the employee is no longer active (it actually does not necessarily mean that the employee was fired). Why the base salary information has to be overwritten when an employee is no longer active will remain a mystery.
Let's explore the Base Salary field, along with the Total Salary Paid field to try to make some sense out of this information.
(data['Base Salary']=='TERMINATED').mean()
6 percent of the employees are no longer active
data['Total Salary Paid'].isnull().sum()
A handful of employees have no recorded pay
Since 'Total Salary Paid' depends on how much time an employee was recieving pay/how many hours an employee logged, and this information is not in the dataset, it seems like 'Base Salary' will be a more useful source of information. Let's see what we're dealing with with this Base Salary field.
## Most frequently occuring values for Base Salary
data['Base Salary'][data['Base Salary']!='TERMINATED']\
.value_counts().head(12)
## Most infrequently occuring values for Base Salary
data['Base Salary'][data['Base Salary']!='TERMINATED']\
.value_counts().tail(10)
Before we get to the fact that some values are missing the frequency of pay (hourly vs. weekly vs. monthly), there are a bunch of employees making 0 dollars? Is this some sort of missing value code or are there actual employees making no money?
## Employees with Base Salary equal to '0'
data[data['Base Salary']=='0'].head(3)
It seems as though almost all the employees that have '0' for Base Salary are employed by the University of Iowa. Let's see who these employees are employed by in aggregate.
## Department Value Counts for employees with Base Salary 0
data[data['Base Salary']=='0'].Department.value_counts()
## Department Value Counts for employees with Base Salary -0-
data[data['Base Salary']=='-0-'].Department.value_counts()
## Department Value Counts for employees with Base Salary 0.00
data[data['Base Salary']=='0.00'].Department.value_counts()
It seems like many university employees have missing values for base salary. This is unfortunate. Are there any University employees with not missing values for Base Salary?
## Department Value Counts for employees with non-missing Base Salary
data[~data['Base Salary'].isin(['0','-0-','0.00','$0.00',np.nan])]\
.Department.value_counts()
So we see that there are still plenty of university employees that have non-missing values for Base Salary. The Universities are very large employers. I am personally surprised at how big of an employer the Lottery Authority is. How many people does it really take to run a lottery?
Anyway, even though we often know how much the people with null values for Base Salary were actually paid with the Total Salary Paid field, this information is not entirely informative because we don't know how much time they worked so we don't know how valuable their time was, which is the more critical question.
I suppose that the total amount paid is also interesting in addition to the value of each employee's time since a gender gap could present itself by a difference in the value of employees time, or in the amount of hours given to the employee by the employer. This however, is problematic since one gender could be choosing to work more or less hours voluntarily. If we had more information, like number of hours worked in addition to number of hours desired on the part of the employee, then this sort of analysis would be possible. However, since this info is not available, I will stick with looking at wages per unit of time.
Now let's see what the different values for frequency of pay are. The Base Salary field is currently a string and it seems to be the case that if there is a frequency of pay supplied, that it is the last two characters of the string.
## match all alphabet words but no numbers
## (http://stackoverflow.com/a/1055173/5552894)
pattern = re.compile('[^\W\d]')
## get pay frequency which are all letters and at the
## end of the Base Salary strings
freqs = [salary.replace('/',' ').strip().split()[-1]
for salary in data['Base Salary'].dropna()
if salary.lower()!='terminated'
and bool(pattern.match(salary.replace('/',' ')\
.strip().split()[-1]))]
## value counts of frequencies
pd.Series(freqs).value_counts()
Ok, so I am guessing BW means base weekly, AN means anually, PW means per week and BM means base month. Let's look at the median payment amount for each pay frequency to see if this interpretation makes sense.
## tries to coerce the string to a number return np.nan if fails
## works well for almost all cases
def convert_to_num(salary):
salary = salary.replace('$','').replace(', ','').\
replace(',','').replace('/','')
try:
return float(salary)
except:
return np.nan
## this data is actually fairly messy, the travesty below is the
## result of playing around with the different cases and it
## actually deals with almost all of the cases appropriately
tuples = [(salary.replace('/',' ').strip().split()[-1],
convert_to_num(''.join(salary.replace('/',' ').strip()\
.split()[:-1])))
if len(salary.replace('/',' ').strip().split())>1
else (salary.replace('/',' '),
convert_to_num(salary.replace('/',' ')))
for salary in data['Base Salary'].dropna()
if salary.lower()!='terminated'
and bool(pattern.match(salary.replace('/',' ').strip()\
.split()[-1]))]
freqs,amounts = zip(*tuples)
## group by frequency of pay,
## display number of entries and median pay amount
df = pd.DataFrame({'pay frequency':freqs,'pay amount':amounts})
df.dropna(subset=['pay amount']).groupby('pay frequency')\
.agg({'pay amount':['median','count']})
So the conjecture about what the frequency codes stand for seems alright. What about all those Base Salary entires that have no pay frequency specified. At first glance it seems like these indicate yearly salaries. Let's see if this is possible.
## get series of numbers that do not have frequency of pay specified
no_freq = pd.Series([convert_to_num(salary)
for salary in data['Base Salary'].dropna()
if not bool(pattern.match(salary\
.replace('/',' ')\
.strip().split()[-1]))
and salary not in ['0','$0.00','-0-','0.00']])
print 'Number of entries with no specified frequency: ',len(no_freq)
print 'Median pay amount of these entires: ',no_freq.median()
So it seems reasonable that these idicate yearly salaries as well. I will now add two columns to the original dataframe - one for pay amount and one for pay frequency.
## insert pay frequency and pay amount into the data frame
data.insert(len(data.columns),'pay amount',[0]*len(data))
data.insert(len(data.columns),'pay frequency',['']*len(data))
data.loc[data['Base Salary'].notnull(),'pay amount'] = \
[convert_to_num(''.join(salary.replace('/',' ')\
.strip().split()[:-1]))
if len(salary.replace('/',' ').strip().split())>1
else convert_to_num(salary.replace('/',' '))
for salary in data['Base Salary'].dropna()]
data.loc[data['Base Salary'].notnull(),'pay frequency'] = \
[salary.replace('/',' ').strip().split()[-1]
if len(salary.replace('/',' ').strip().split())>1
else 'YR' ## if no frequency specified, freq=YR
for salary in data['Base Salary'].dropna()]
## drop entries with null salaries
data = data.dropna(axis=0,subset=['Base Salary','pay amount'])
## drop entries with 0 for pay amount
data = data[data['pay amount']!=0]
Now we have the following pay frequency value counts:
data['pay frequency'].value_counts()
Let's combine the codes that we know are the same and name them in a fashion that is similiar to one another.
## map from old code to new code
mapper = {'BW':'WK','DA':'DAY','HRLY':'HR','AN':'YR','DAILY':'DAY',
'PW':'WK','BM':'MO'}
## converting codes
data['pay frequency'] = [mapper[freq] if freq in mapper else freq
for freq in data['pay frequency']]
## new pay frequency value count
data['pay frequency'].value_counts()
## dropping the employee who gets paid monthly
data = data[data['pay frequency']!='MO']
Now I will clean up the frame by dropping columns that I have already ruled out using or are uninteresting to the question at hand.
data = data.drop(['Name','Base Salary','Base Salary Date',
'Total Salary Paid','Travel & Subsistence'],axis=1)
data.head()
We've already explored the Department field, let's explore some of the other ones to answer a couple questions like: how many nulls are in the Agency/Institution field?, what is the value counts on Gender?, what does the value counts on Place of Residence and Position look like?
## value counts and unique count for Agency/Institution Field
print data['Agency/Institution']\
.value_counts(dropna=False).head(10),'\n'
print 'Number of unique Agency/Institution values: ',\
pd.unique(data['Agency/Institution']).shape[0]
## value counts for Gender field
print data['Gender'].value_counts(dropna=False)
## dropping the 5 employees who are not recorded as Male or Female
## because there is not a big enough
## sample size to inlcude them as a category
data = data[data.Gender.isin(['M','F'])]
## value counts and unique count for Place of Residence field
print data['Place of Residence']\
.value_counts(dropna=False).head(10),'\n'
print 'Number of unique Place of Residence values: ',\
pd.unique(data['Place of Residence']).shape[0]
## value counts and unique count for Position field
print data['Position']\
.value_counts(dropna=False).head(10),'\n'
print 'Number of unique Position values: ',\
pd.unique(data['Position']).shape[0]
Since there are so many nulls in the Agency/Institution field, I do not think it will be useful. Overall there are more female employees than males. The Place of Residency seems to be a mishmash of cities and counties from what I can tell. This will be useful because wages depend on the cost of living which can vary intrastate. There are a ton of unique values for Position because of the level of detail specified and some redundancy - Associate Professor and Assoc Professor. To eliminate the redundancy would take a lot of work so I will not bother with this for this simple fun analysis.
In order to make the fairest comparison possible of employees' salaries, the employees being compared should be in the same department, hold the same position, live in the same place, and be paid at the same frequency.
They should be in the same department because rates of pay can vary from department to department. They should hold the same position for obvious reasons. They should live in the same place because as said before, pay can vary from place to place because of the cost of living. And finally, they should be paid with the same frequency. This last point is subtle. One could try to standardize hourly, weekly and yearly salaries all to the same unit like hourly pay but to do this one must assume how many hours per year and per week each employee is working, which is not in the data. Making this assumption might seem harmless but it's possible that this could obscure the goal of comparing the pay of the two genders. For example, if one converts the hourly rate of pay to a yearly salary for positions where hours are scarce, this will misrepresent the true compensation of these employees. If one gender posesses these hourly positions disproportionately, this will distort the comparison between genders. Likewise, if one does the opposite and converts yearly salaries to hourly rates by assuming the hours worked per year, this will overestimate the compensation of workers who hold position where more hours or harder work is expected of them. If one gender disproportionately holds these sorts of positions, this too could confuse the comparison of wages between genders.
In looking at the question of the wage gap, often the argument that males' careers are less frequently interrupted for the sake of raising a family is given to explain away the wage gap. It is my hope that by only comparing employees in this dataset who hold the exact same title, and are paid with the same frequency will make such an argument less strong since as far as payroll is concerned, employees who hold the same position in the same department and get paid at the same frequency should be viewed as equally committed employees. Perhaps seniority would be another factor that it would be good to control for. Unfortunately, this is not in the dataset. So we'll just do the best we can to control for as much as we can.
Let's look at the most frequently occuring (place, position, department, pay frequency) tuples in the dataset.
## Most frequently occuring (place, position,
## department, pay frequency) tuples
data.groupby(['Place of Residence','Position',
'Department','pay frequency'])\
.size().sort_values(ascending=False)\
.head().to_frame()\
.rename(columns={0:'Count'})
200 seems like a reasonable lower bound for sample size, so let's see how many of these tuples have more that 200 entries.
print 'Num. of (Pl,Pos,Dept,Freq) tuples with >200 employees',\
'in data:',\
(data.groupby(['Place of Residence','Position',
'Department','pay frequency'])\
.size()>200).sum()
I will make a further restriction that the lower bound of representatives of each gender in any particular tuple is at least 60, so that the sample size of any group is not too small. Let's see how many tuples remain.
## get set of tuples with 200 or more employees
tup_size = data.groupby(['Place of Residence','Position',
'Department','pay frequency']).size()
over_200 = set(tup_size[tup_size>200].index)
## get set of tuples with at least 60 of each gender
num_unique_genders = data.groupby(['Place of Residence','Position',
'Department','pay frequency'])\
['Gender'].apply(lambda x: len(pd.unique(x)))
min_freq_gender = data.groupby(['Place of Residence','Position',
'Department','pay frequency'])\
['Gender'].apply(lambda x:
x.value_counts().min())
over_60_each_gender = \
set(min_freq_gender.index[(num_unique_genders>1)\
& (min_freq_gender>60)])
valid_tuples = over_200.intersection(over_60_each_gender)
print 'Number of tuples with over 200 employees and at least 60',\
'in each gender: ', len(valid_tuples)
The plan is to compare the median salary between men and women for each of these 164 different place,position,departments,pay frequency tuples to see how the wages of men and women compare. Before we can get to this very exciting part, I have to do one more munging step. The data represents wages over 9 years, and over this 9 years pay has been rising. In order to retain the decent sample sizes that I have for comparisons, I am going to adjust wages so that they are all in terms of 2007 dollars. I'll do this by using the change in median wages for each pay frequency and gender and then adjusting all wages by assuming all jobs with the same pay frequency and gender are changing by the same proportion. This is not perfect, but it'll be fine for getting an idea of how the wages of the two genders compare for the different types of jobs since the sample size will be big enough that it should not throw off the comparison much.
While we are aggregating the median wage across time for each gender and each pay frequency, let's visualize the median wages across time.
## getting dataframe to be used to plot
unstacked = data.groupby(['Fiscal Year','pay frequency','Gender'])\
.median().unstack(0).T
unstacked.index = unstacked.index.droplevel(0)
## plotting median wage for each pay frequency for men and women
fig,ax = plt.subplots(2,2,figsize=(12,6),sharex=True)
axes = ax.flat
for u,freq in enumerate(['HR','DAY','WK','YR']):
unstacked.xs(freq,axis=1).plot(ax=ax[u/2,u%2]);
axes[u].set_title(freq)
axes[u].ticklabel_format(useOffset=False)
axes[u].set_ylabel('Wage')
## store percent change so we can adjust wages all to 2007 levels
unstacked = data.groupby(['Fiscal Year','pay frequency','Gender'])\
.median().unstack([0]).T
unstacked.index = unstacked.index.droplevel(0)
pct_change = ((unstacked - unstacked.iloc[0])/unstacked*100)
pct_change = pct_change.to_dict()
## adjust the pay amount to 2007 dollars
def adjust_wage(x):
year,gender,freq = [x.iloc[0][col]
for col in ['Fiscal Year','Gender',
'pay frequency']]
x['pay amount 2007 dollars'] = x['pay amount'] - \
pct_change[(freq,gender)][year]*x['pay amount']/100
return x
## insert adjusted pay amount and apply adjusting function
data.insert(len(data.columns),'pay amount 2007 dollars',[0]*len(data))
data = data.groupby(['Fiscal Year','Gender','pay frequency'])\
.apply(lambda x: adjust_wage(x))
## top of new data frame
data.head(3)
## bottom of new data frame
data.tail(3)
It is interesting to see in aggregate that female's wages are significantly higher for employees who get paid hourly and daily, while much lower for employees who get paid yearly. The fact that median wages for females who get paid yearly is lower than males was expected for me but I did not expect that in aggregate females' hourly pay would be higher than males'.
Also cool to see that the gap in median weekly pay has closed over the 9 years spanned by the dataset.
Now, the aggregate statistics, while interesting, do not really tell us specific information in regards to fairness of wages. For example, what if skilled female workers end up taking more hourly paying jobs rather than longer commitment jobs because they do not plan to be in the work force for extended periods of time. This would mean that female hourly workers are more skilled in aggregate than male hourly workers and should thus command a higher wage. To really examine any gap in pay, we will use the tuples saved from before of places,positions,departments,pay frequency with large enough samples of both genders to see how the wages of the two genders compare. Of course this is not the complete story since within one position and department and pay frequency, there could be different levels of responsibility or seniority that might correspond with rightly higher earnings. However, with this informantion, this is the best we can do.
Within each identifying tuple, I am going to look at the difference in median pay of the males and the females of the group. I'll divide by the overall median of the group to obtain a standardized effect size. For each tuple, I'll compute this statistic and also a 95% confidence interval of the statistic that results from boostrapped sampling of each subgroup (gender within a group).
$ Pay \ Gap \ as \ a \ Percent = \frac{(Median \ Wage \ of \ Males \ in \ Position - Median \ Wage \ of \ Females \ in \ Position)}{Median \ Wage \ of \ Position} * 100 $
## bootstap confidence interval function
def get_ci(x1,x2,its=5000):
stats = []
overall_median = np.median(np.concatenate([x1,x2]))
for _ in range(its):
x1_ = np.random.choice(x1,size=len(x1),replace=True)
x2_ = np.random.choice(x2,size=len(x2),replace=True)
stats.append((np.median(x1_) -
np.median(x2_))/overall_median * 100)
return (round(np.std(stats),2),
tuple(map(lambda x: round(x,2),
np.percentile(stats,q=[2.5,97.5]))))
## dict that will become dataframe
results = OrderedDict([('Department',[]),('Place of Residence',[]),
('Position',[]),('Pay Frequency',[]),
('Median Wage',[]),('Pay Gap as Percentage',[]),
('Num. of Males',[]),('Num. of Females',[]),
('Standard Deviation',[]),('95% C.I.',[])])
## for each tuple with enough of each gender,
## calculate and save statistics
for place,position,department,payfreq in valid_tuples:
subset = data[(data['Place of Residence']==place) &
(data.Position==position) &
(data.Department==department) &
(data['pay frequency']==payfreq)]
male_pay = subset[subset.Gender=='M']['pay amount'].values
female_pay = subset[subset.Gender=='F']['pay amount'].values
overall_median = np.median(np.concatenate([male_pay,female_pay]))
results['Place of Residence'].append(place)
results['Position'].append(position)
results['Department'].append(department)
results['Pay Frequency'].append(payfreq)
diff_medians = np.median(male_pay) - np.median(female_pay)
results['Pay Gap as Percentage'].append(round(diff_medians\
/overall_median*100,2))
std,ci = get_ci(male_pay,female_pay)
results['Standard Deviation'].append(std)
results['95% C.I.'].append(ci)
results['Median Wage'].append(int(overall_median))
results['Num. of Males'].append(len(male_pay))
results['Num. of Females'].append(len(female_pay))
I'll display the table ordered by the size of the effect. But I will order by the lower bound of the confidence interval of the effect size in order to show more significant results higher. A negative Pay Gap indicates that the median income of women is more than the median income of men, while a positive indicates higher pay for men.
## make dataframe and sort it by the bound
## of the ci of the effect size
results_df = pd.DataFrame(results)
lower_bound_effect_size = [ci[0]
if gap>0 else -1*ci[1]
for ci,gap in
zip(results_df['95% C.I.'],
results_df['Pay Gap as Percentage'])]
results_df.iloc[np.argsort(lower_bound_effect_size)[::-1]]\
.reset_index(drop=True).head(20)
So 18 of the top 20 most significant effects are for men's wages being higher than womens'. Overall, the jobs with the strongest significance in pay gap seem to be high paying jobs - jobs at the university that require a high level of education.
There is one extreme outlier which is the position of 'Adjunct Assistant Professor'. Suprisingly, the women employed under this title seem to have much higher earnings than their male counterparts. I am not sure if this is just some sort of artifact of some strange hiring program. In the next cell I make a histogram showing just the individual wages at this position for men and women. It seems to be the case that there is a large variation in salary at this position and that men disproportionally have the lower salaries. Without more information I don't know what to make of this and I'll just label it a curiosity.
## plot a histogram of the outlier position
## of Adjunct Assitant Professor
subset = data[(data.Department=='University of Iowa') &
(data['Place of Residence']=='JOHNSON') &
(data.Position=='Adjunct Assistant Professor') &
(data['pay frequency']=='YR')]
plt.figure(figsize=(10,5))
plt.hist([subset[subset.Gender=='M']['pay amount'],
subset[subset.Gender=='F']['pay amount']],
stacked=False,label=['Male','Female']);
plt.title("University of Iowa Adjunct Assitant Professors' Wages")
plt.ylabel('Frequency');plt.xlabel('Salary');
plt.legend();
Let's investigate the correlation between Median Wage and Pay Gap as a Percentage. We'll look at this for each Pay Frequency since it is inappropriate to compare wages for different pay frequencies. First here are some simple summary statistics of Pay Gap as a Percentage for each Pay Frequency.
## summarize Pay Gap as Percentage across different Pay Frequencies
results_df.groupby('Pay Frequency')\
.agg({'Pay Gap as Percentage':
[np.median,np.mean,'count']})
Since there is only 1 position to compare for daily wages and 4 for hourly wages, it is difficulty to draw any conclusion from these pay frequencies. For weekly paid jobs, the median pay gap is one hundreth of a percent with the mean is 1.8%. For Yearly, the median is 1.63% and the mean is 3.44%. The gap between the means and medians of these two distributions indicates that they are skewed.
Let's look at the correlation between Median Wage and Pay Gap as a Percentage. We'll use Spearman's ranking correlation which I like to use because it does not assume normality of the distributions (good since the distribution of Pay Gap does not seem normal) and also since it's a really nice function in SciPy. I'll apply the Spearman correlation function to each Pay Frequency and what will result is a tuple - (rank correlation coefficient, pvalue).
## rank correlation coefficients between Median Wage and Pay Gap
## as a Percentage for each Pay Frequency
results_df.groupby('Pay Frequency')\
.apply(lambda x: spearmanr(x['Median Wage'],
x['Pay Gap as Percentage']))\
.to_frame().rename(columns={0:'(Correlation Coefficient,'+\
' p-value)'})
Median Wage and Pay Gap as a Percentage are positively correlated for each Pay Frequency. For weekly pay, although there is a positive correlation coefficient, the pvalue is not significant. For yearly pay however, there is a very strong correlation between Median Wage and Pay Gap as as Percentage.
To graph this relationship, I will bin the median wage variable into 6 bins and display a chart of the median Pay Gap as a Percentage for each bin. I'll do this for yearly wages.
## matplotlib is amazing!
yearly = results_df[(results_df['Pay Frequency']=='YR')]
n_bins = 6
edges = np.histogram(yearly['Median Wage'].values,bins=n_bins)[1]
x = np.digitize(yearly['Median Wage'].values,edges)
fig, ax1 = plt.subplots(figsize=(15,10))
ax2 = ax1.twinx()
ax1.bar(range(1,n_bins+1),[yearly[x==i]['Pay Gap as Percentage']\
.median()
for i in range(1,n_bins+1)]
,width=1.0,label='Median Pay Gap as Percentage Inside Bin')
ax2.bar([_+.4 for _ in range(1,n_bins+1)],
[len(yearly[x==i]) for i in range(1,n_bins+1)],
width=.2,color='red',label='Number of Jobs in Bin')
plt.xticks(range(1,n_bins+1),["{:,}".format(int(edge))
for edge in edges])
ax1.tick_params(labelsize=22); ax2.tick_params(labelsize=22);
ax1.set_ylim(0,35); ax2.set_ylim(0,60);
ax1.yaxis.get_major_ticks()[0].set_visible(False);
ax1.legend(loc=2,prop={'size':18});
ax2.legend(loc=0,prop={'size':20});
ax1.set_xlabel('Median Wage Bin Edges',fontsize=24,labelpad=24);
ax1.set_ylabel('Median Pay Gap as Percentage',
fontsize=24,labelpad=18);
ax2.set_ylabel('Number of Jobs in Bin',rotation=-90,
fontsize=24,labelpad=32);
plt.title('Median Wage vs. Pay Gap as Percentage '+\
'for Yearly Paid Jobs',fontsize=26,y=1.03);
Conclusion¶
Female' wages for more frequently (hourly or daily) paying jobs are higher. To venture a guess, skilled women who do not want jobs that require more commitment are not moving up to typically more secure, less frequently paying jobs for the sake of raising a family. The women who take these more frequently paying jobs are more skilled than their male counterparts and therefore command higher salaries.
In the middle of the spectrum of frequency of pay, weekly paying jobs, there is very little gap between women's and men's wages.
At the least frequent end of the spectrum, yearly paying positions, there is a median Pay Gap of around 1.63%. The distribution of Pay Gap is right-skewed with some positions having very large gaps. Positions with large gaps tend to be the higher paying positions. Indeed, there is a very significant positive correlation between median pay for yearly paying positions and the Pay Gap as a Percentage.