Public Data Munging - Exploring the Gender Wage Gap for Iowa State Employees


I recently listened to a podcast that talked about all the data that has been made available as a result of the open data intiative championed by the Obama Administration. contains a lot of random datasets. Most seem fairly uninteresting because they have been aggregated or summarized before being published. However, there are literally thousands of datasets - the quantity is overwhelming.

Since Trump is actually going to be president, there is no gaurentee that this data will stay available. Therefore, carpe data! I pulled an interesting looking data set - the salary information on state employees of Iowa (link). After looking at the fields made available, I think it would be interesting to munge some salary data in order to explore the income gap by gender. How much money are females making in the workplace compared to their male counterparts? Are there particular disciplines where they are doing better or worse than others? Let's explore.

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import spearmanr
import re
from collections import OrderedDict
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
data = pd.read_csv('./../data/Iowa_Salaries'+\
In [3]:
(605925, 11)
In [4]:
Fiscal Year Department Agency/Institution Name Gender Place of Residence Position Base Salary Base Salary Date Total Salary Paid Travel & Subsistence
0 2007 Administrative Services, Department of NaN ABELS BEVERLY J F POLK ADVANCED PERSONNEL MGMT 31.62 HR 07/01/2007 12:00:00 AM $6462.34 NaN
1 2007 Administrative Services, Department of NaN ABRAMS JERRY A M WAPELLO EARLY OUT POSITION TERMINATED 07/01/2007 12:00:00 AM $4242.82 NaN
2 2007 Administrative Services, Department of NaN ADAMS CAROL L F MITCHELL ADVANCED PERSONNEL MGMT 2,212.80 BW 07/01/2007 12:00:00 AM $55065.60 $924.44

So, in accordance with the principle that all datasets have their quirks, the 'Base Salary' field, which is probably the most interesting field, is replace by 'TERMINATED' if the employee is no longer active (it actually does not necessarily mean that the employee was fired). Why the base salary information has to be overwritten when an employee is no longer active will remain a mystery.

Let's explore the Base Salary field, along with the Total Salary Paid field to try to make some sense out of this information.

In [5]:
(data['Base Salary']=='TERMINATED').mean()

6 percent of the employees are no longer active

In [6]:
data['Total Salary Paid'].isnull().sum()

A handful of employees have no recorded pay

Since 'Total Salary Paid' depends on how much time an employee was recieving pay/how many hours an employee logged, and this information is not in the dataset, it seems like 'Base Salary' will be a more useful source of information. Let's see what we're dealing with with this Base Salary field.

In [7]:
## Most frequently occuring values for Base Salary
data['Base Salary'][data['Base Salary']!='TERMINATED']\
0               18501
-0-             14655
0.00             8963
1,953.60 BW      2429
2,136.00 BW      2317
1,767.20 BW      1926
1,616.00 BW      1868
1,853.60 BW      1770
1,695.20 BW      1701
46583            1657
2,045.60 BW      1544
25,000.00 YR     1279
Name: Base Salary, dtype: int64
In [8]:
## Most infrequently occuring values for Base Salary
data['Base Salary'][data['Base Salary']!='TERMINATED']\
148,770.00    1
134,588.00    1
85078         1
132,604.00    1
49740         1
52,887.00     1
51,184.00     1
156,524.00    1
42,506.00     1
36,378.00     1
Name: Base Salary, dtype: int64

Before we get to the fact that some values are missing the frequency of pay (hourly vs. weekly vs. monthly), there are a bunch of employees making 0 dollars? Is this some sort of missing value code or are there actual employees making no money?

In [9]:
## Employees with Base Salary equal to '0'
data[data['Base Salary']=='0'].head(3)
Fiscal Year Department Agency/Institution Name Gender Place of Residence Position Base Salary Base Salary Date Total Salary Paid Travel & Subsistence
41955 2007 University of Iowa NaN ACHEPOHL,KEITH A M LANE Professor Emeritus 0 07/01/2007 12:00:00 AM $4872.22 $2762.51
41958 2007 University of Iowa NaN ACKELSON,SUSAN JANE F POLK Adjunct Instructor 0 07/01/2007 12:00:00 AM $1660.00 NaN
42000 2007 University of Iowa NaN ADAMSON,TIMOTHY L M HENRY Adjunct Assistant Professor 0 07/01/2007 12:00:00 AM $9600.00 NaN

It seems as though almost all the employees that have '0' for Base Salary are employed by the University of Iowa. Let's see who these employees are employed by in aggregate.

In [10]:
## Department Value Counts for employees with Base Salary 0
data[data['Base Salary']=='0'].Department.value_counts()
Iowa State University          11140
University of Iowa              6128
University of Northern Iowa     1233
Name: Department, dtype: int64
In [11]:
## Department Value Counts for employees with Base Salary -0-
data[data['Base Salary']=='-0-'].Department.value_counts()
Iowa State University          13245
University of Northern Iowa     1410
Name: Department, dtype: int64
In [12]:
## Department Value Counts for employees with Base Salary 0.00
data[data['Base Salary']=='0.00'].Department.value_counts()
University of Iowa             4585
Iowa State University          3908
University of Northern Iowa     470
Name: Department, dtype: int64

It seems like many university employees have missing values for base salary. This is unfortunate. Are there any University employees with not missing values for Base Salary?

In [13]:
## Department Value Counts for employees with non-missing Base Salary
data[~data['Base Salary'].isin(['0','-0-','0.00','$0.00',np.nan])]\
University of Iowa                                     168564
Human Services, Department of                           60487
Iowa State University                                   56718
Transportation, Department of                           35415
Corrections, Department of                              33024
University of Northern Iowa                             21040
Judicial Branch                                         20075
Natural Resources, Department of                        15973
Public Safety, Department of                            10030
Iowa Workforce Development                               9139
Iowa Veterans Home                                       8937
Education, Department of                                 8734
Inspections & Appeals, Department of                     6425
Legislative Branch                                       6201
Public Health, Department of                             5550
Public Defense, Department of                            4582
Agriculture & Land Stewardship, Department of            4146
Revenue, Department of                                   4112
Commerce, Department of                                  3814
Administrative Services, Department of                   3585
Regents, Board of                                        3246
Attorney General, Office of                              2791
Veterans Affairs, Department of                          2504
Administrative Services, Department of                   1348
Iowa Lottery Authority                                   1290
Auditor of State, Office of                              1264
Iowa Finance Authority                                   1114
Blind, Department for the                                1008
Iowa Communications Network                               962
Cultural Affairs, Department of                           958
IPERS                                                     860
Economic Development, Department of                       812
Iowa Economic Development Authority                       770
Human Rights, Department of                               602
Iowa Student College Aid Commission                       533
Governor, Office of                                       427
Secretary of State, Office of                             422
Iowa Civil Rights Commission                              419
Law Enforcement Academy                                   314
Treasurer of State, Office of                             309
Management, Department of                                 297
Aging, Department on                                      281
Chief Information Officer, Office of                      242
Parole, Board of                                          190
Elder Affairs, Department of                              122
Public Employment Relations Board                         117
Homeland Security & Emergency Management Department        95
Energy Independence, Office of                             85
Ethics & Campaign Disclosure Board                         82
Drug Control Policy, Office of                             71
Rebuild Iowa Office                                        24
Public Information Board                                   10
Governor-Elect, Office of                                   9
Name: Department, dtype: int64

So we see that there are still plenty of university employees that have non-missing values for Base Salary. The Universities are very large employers. I am personally surprised at how big of an employer the Lottery Authority is. How many people does it really take to run a lottery?

Anyway, even though we often know how much the people with null values for Base Salary were actually paid with the Total Salary Paid field, this information is not entirely informative because we don't know how much time they worked so we don't know how valuable their time was, which is the more critical question.

I suppose that the total amount paid is also interesting in addition to the value of each employee's time since a gender gap could present itself by a difference in the value of employees time, or in the amount of hours given to the employee by the employer. This however, is problematic since one gender could be choosing to work more or less hours voluntarily. If we had more information, like number of hours worked in addition to number of hours desired on the part of the employee, then this sort of analysis would be possible. However, since this info is not available, I will stick with looking at wages per unit of time.

Now let's see what the different values for frequency of pay are. The Base Salary field is currently a string and it seems to be the case that if there is a frequency of pay supplied, that it is the last two characters of the string.

In [14]:
## match all alphabet words but no numbers
## (
pattern = re.compile('[^\W\d]')
In [15]:
## get pay frequency which are all letters and at the 
## end of the Base Salary strings
freqs = [salary.replace('/',' ').strip().split()[-1]
         for salary in data['Base Salary'].dropna()
         if salary.lower()!='terminated' 
         and bool(pattern.match(salary.replace('/',' ')\
In [16]:
## value counts of frequencies
BW       204794
HR        20047
YR         7146
DA         2003
HRLY        106
AN           79
DAILY        64
LEAVE        45
PW            2
BM            1
dtype: int64

Ok, so I am guessing BW means base weekly, AN means anually, PW means per week and BM means base month. Let's look at the median payment amount for each pay frequency to see if this interpretation makes sense.

In [17]:
## tries to coerce the string to a number return np.nan if fails
## works well for almost all cases
def convert_to_num(salary):
    salary = salary.replace('$','').replace(', ','').\
        return float(salary)
        return np.nan
In [18]:
## this data is actually fairly messy, the travesty below is the 
## result of playing around with the different cases and it 
## actually deals with almost all of the cases appropriately
tuples = [(salary.replace('/',' ').strip().split()[-1],
           convert_to_num(''.join(salary.replace('/',' ').strip()\
          if len(salary.replace('/',' ').strip().split())>1
          else (salary.replace('/',' '),
                convert_to_num(salary.replace('/',' ')))
          for salary in data['Base Salary'].dropna()
          if salary.lower()!='terminated' 
          and bool(pattern.match(salary.replace('/',' ').strip()\

freqs,amounts = zip(*tuples)
In [19]:
## group by frequency of pay, 
## display number of entries and median pay amount
df = pd.DataFrame({'pay frequency':freqs,'pay amount':amounts})
df.dropna(subset=['pay amount']).groupby('pay frequency')\
        .agg({'pay amount':['median','count']})
pay amount
median count
pay frequency
AN 10000.000 79
BM 1363.370 1
BW 2019.200 204794
DA 103.920 2003
DAILY 149.000 64
HR 15.350 20046
HRLY 17.220 106
PW 3385.295 2
YR 37740.000 7145

So the conjecture about what the frequency codes stand for seems alright. What about all those Base Salary entires that have no pay frequency specified. At first glance it seems like these indicate yearly salaries. Let's see if this is possible.

In [20]:
## get series of numbers that do not have frequency of pay specified
no_freq = pd.Series([convert_to_num(salary)
                     for salary in data['Base Salary'].dropna()
                     if not bool(pattern.match(salary\
                                               .replace('/',' ')\
                     and salary not in ['0','$0.00','-0-','0.00']])
print 'Number of entries with no specified frequency: ',len(no_freq)
print 'Median pay amount of these entires: ',no_freq.median()
Number of entries with no specified frequency:  237019
Median pay amount of these entires:  52102.0

So it seems reasonable that these idicate yearly salaries as well. I will now add two columns to the original dataframe - one for pay amount and one for pay frequency.

In [21]:
## insert pay frequency and pay amount into the data frame
data.insert(len(data.columns),'pay amount',[0]*len(data))
data.insert(len(data.columns),'pay frequency',['']*len(data))
data.loc[data['Base Salary'].notnull(),'pay amount'] = \
            [convert_to_num(''.join(salary.replace('/',' ')\
             if len(salary.replace('/',' ').strip().split())>1
             else convert_to_num(salary.replace('/',' '))
             for salary in data['Base Salary'].dropna()]
data.loc[data['Base Salary'].notnull(),'pay frequency'] = \
            [salary.replace('/',' ').strip().split()[-1]
             if len(salary.replace('/',' ').strip().split())>1
             else 'YR' ## if no frequency specified, freq=YR
             for salary in data['Base Salary'].dropna()]
In [22]:
## drop entries with null salaries
data = data.dropna(axis=0,subset=['Base Salary','pay amount'])

## drop entries with 0 for pay amount
data = data[data['pay amount']!=0]

Now we have the following pay frequency value counts:

In [23]:
data['pay frequency'].value_counts()
YR       243658
BW       204794
HR        20046
DA         2003
HRLY        106
AN           79
DAILY        64
PW            2
BM            1
Name: pay frequency, dtype: int64

Let's combine the codes that we know are the same and name them in a fashion that is similiar to one another.

In [24]:
## map from old code to new code
mapper = {'BW':'WK','DA':'DAY','HRLY':'HR','AN':'YR','DAILY':'DAY',
In [25]:
## converting codes
data['pay frequency'] = [mapper[freq] if freq in mapper else freq
                         for freq in data['pay frequency']]
In [26]:
## new pay frequency value count
data['pay frequency'].value_counts()
YR     243737
WK     204796
HR      20152
DAY      2067
MO          1
Name: pay frequency, dtype: int64
In [27]:
## dropping the employee who gets paid monthly 
data = data[data['pay frequency']!='MO']

Now I will clean up the frame by dropping columns that I have already ruled out using or are uninteresting to the question at hand.

In [28]:
data = data.drop(['Name','Base Salary','Base Salary Date',
                  'Total Salary Paid','Travel & Subsistence'],axis=1)
In [29]:
Fiscal Year Department Agency/Institution Gender Place of Residence Position pay amount pay frequency
0 2007 Administrative Services, Department of NaN F POLK ADVANCED PERSONNEL MGMT 31.62 HR
2 2007 Administrative Services, Department of NaN F MITCHELL ADVANCED PERSONNEL MGMT 2212.80 WK
3 2007 Administrative Services, Department of NaN F POLK ACCOUNTING TECHNICIAN 2 1226.40 WK
8 2007 Administrative Services, Department of NaN F POLK INFO TECH SPECIALIST 2 2194.40 WK
12 2007 Administrative Services, Department of NaN M POLK EXEC OFF 4 3375.20 WK

We've already explored the Department field, let's explore some of the other ones to answer a couple questions like: how many nulls are in the Agency/Institution field?, what is the value counts on Gender?, what does the value counts on Place of Residence and Position look like?

In [30]:
## value counts and unique count for Agency/Institution Field
print data['Agency/Institution']\
print 'Number of unique Agency/Institution values: ',\
NaN                                             307506
Univ. Of Iowa Hospital & Clinics                 66037
Human Services Administration                    27252
Human Services Glenwood                           8566
Human Services Woodward                           6943
Corrections IA Medical Classification Center      5327
Corrections Fort Madison                          4534
Corrections Anamosa                               3263
Corrections Fort Dodge                            3089
Corrections Newton                                2897
Name: Agency/Institution, dtype: int64 

Number of unique Agency/Institution values:  37
In [31]:
## value counts for Gender field
print data['Gender'].value_counts(dropna=False)
F      262672
M      208075
NaN         3
FM          1
*           1
Name: Gender, dtype: int64
In [32]:
## dropping the 5 employees who are not recorded as Male or Female
## because there is not a big enough 
## sample size to inlcude them as a category
data = data[data.Gender.isin(['M','F'])]
In [33]:
## value counts and unique count for Place of Residence field
print data['Place of Residence']\
print 'Number of unique Place of Residence values: ',\
        pd.unique(data['Place of Residence']).shape[0]
JOHNSON       125704
POLK           65845
STORY          50825
BLACK HAWK     20380
LINN           19686
BOONE          11127
WASHINGTON     10498
MARSHALL       10030
DALLAS          7593
MILLS           6314
Name: Place of Residence, dtype: int64 

Number of unique Place of Residence values:  603
In [34]:
## value counts and unique count for Position field
print data['Position']\
print 'Number of unique Position values: ',\
Staff Nurse                 17283
Professor                    7140
Custodian I                  5773
HT ASSOCIATE                 5647
SOCIAL WORKER 2              5387
Clerk III                    4927
Associate Professor          4577
Name: Position, dtype: int64 

Number of unique Position values:  4654

Since there are so many nulls in the Agency/Institution field, I do not think it will be useful. Overall there are more female employees than males. The Place of Residency seems to be a mishmash of cities and counties from what I can tell. This will be useful because wages depend on the cost of living which can vary intrastate. There are a ton of unique values for Position because of the level of detail specified and some redundancy - Associate Professor and Assoc Professor. To eliminate the redundancy would take a lot of work so I will not bother with this for this simple fun analysis.

In order to make the fairest comparison possible of employees' salaries, the employees being compared should be in the same department, hold the same position, live in the same place, and be paid at the same frequency.

They should be in the same department because rates of pay can vary from department to department. They should hold the same position for obvious reasons. They should live in the same place because as said before, pay can vary from place to place because of the cost of living. And finally, they should be paid with the same frequency. This last point is subtle. One could try to standardize hourly, weekly and yearly salaries all to the same unit like hourly pay but to do this one must assume how many hours per year and per week each employee is working, which is not in the data. Making this assumption might seem harmless but it's possible that this could obscure the goal of comparing the pay of the two genders. For example, if one converts the hourly rate of pay to a yearly salary for positions where hours are scarce, this will misrepresent the true compensation of these employees. If one gender posesses these hourly positions disproportionately, this will distort the comparison between genders. Likewise, if one does the opposite and converts yearly salaries to hourly rates by assuming the hours worked per year, this will overestimate the compensation of workers who hold position where more hours or harder work is expected of them. If one gender disproportionately holds these sorts of positions, this too could confuse the comparison of wages between genders.

In looking at the question of the wage gap, often the argument that males' careers are less frequently interrupted for the sake of raising a family is given to explain away the wage gap. It is my hope that by only comparing employees in this dataset who hold the exact same title, and are paid with the same frequency will make such an argument less strong since as far as payroll is concerned, employees who hold the same position in the same department and get paid at the same frequency should be viewed as equally committed employees. Perhaps seniority would be another factor that it would be good to control for. Unfortunately, this is not in the dataset. So we'll just do the best we can to control for as much as we can.

Let's look at the most frequently occuring (place, position, department, pay frequency) tuples in the dataset.

In [35]:
## Most frequently occuring (place, position,
## department, pay frequency) tuples
data.groupby(['Place of Residence','Position',
              'Department','pay frequency'])\
Place of Residence Position Department pay frequency
JOHNSON Staff Nurse University of Iowa YR 9844
Professor University of Iowa YR 6883
Associate Professor University of Iowa YR 4321
STORY ASSOC PROF Iowa State University YR 3671
PROF Iowa State University YR 3658

200 seems like a reasonable lower bound for sample size, so let's see how many of these tuples have more that 200 entries.

In [36]:
print 'Num. of (Pl,Pos,Dept,Freq) tuples with >200 employees',\
      'in data:',\
      (data.groupby(['Place of Residence','Position',
                     'Department','pay frequency'])\
Num. of (Pl,Pos,Dept,Freq) tuples with >200 employees in data: 263

I will make a further restriction that the lower bound of representatives of each gender in any particular tuple is at least 60, so that the sample size of any group is not too small. Let's see how many tuples remain.

In [37]:
## get set of tuples with 200 or more employees
tup_size = data.groupby(['Place of Residence','Position',
                         'Department','pay frequency']).size()
over_200 = set(tup_size[tup_size>200].index)
In [38]:
## get set of tuples with at least 60 of each gender
num_unique_genders = data.groupby(['Place of Residence','Position',
                                'Department','pay frequency'])\
                        ['Gender'].apply(lambda x: len(pd.unique(x)))
min_freq_gender = data.groupby(['Place of Residence','Position',
                                'Department','pay frequency'])\
                        ['Gender'].apply(lambda x: 

over_60_each_gender = \
                                  & (min_freq_gender>60)])
In [39]:
valid_tuples = over_200.intersection(over_60_each_gender)
print 'Number of tuples with over 200 employees and at least 60',\
      'in each gender: ', len(valid_tuples)
Number of tuples with over 200 employees and at least 60 in each gender:  164

The plan is to compare the median salary between men and women for each of these 164 different place,position,departments,pay frequency tuples to see how the wages of men and women compare. Before we can get to this very exciting part, I have to do one more munging step. The data represents wages over 9 years, and over this 9 years pay has been rising. In order to retain the decent sample sizes that I have for comparisons, I am going to adjust wages so that they are all in terms of 2007 dollars. I'll do this by using the change in median wages for each pay frequency and gender and then adjusting all wages by assuming all jobs with the same pay frequency and gender are changing by the same proportion. This is not perfect, but it'll be fine for getting an idea of how the wages of the two genders compare for the different types of jobs since the sample size will be big enough that it should not throw off the comparison much.

While we are aggregating the median wage across time for each gender and each pay frequency, let's visualize the median wages across time.

In [40]:
## getting dataframe to be used to plot
unstacked = data.groupby(['Fiscal Year','pay frequency','Gender'])\
unstacked.index = unstacked.index.droplevel(0)
In [41]:
## plotting median wage for each pay frequency for men and women
fig,ax = plt.subplots(2,2,figsize=(12,6),sharex=True)
axes = ax.flat

for u,freq in enumerate(['HR','DAY','WK','YR']):
In [42]:
## store percent change so we can adjust wages all to 2007 levels
unstacked = data.groupby(['Fiscal Year','pay frequency','Gender'])\
unstacked.index = unstacked.index.droplevel(0)
pct_change = ((unstacked - unstacked.iloc[0])/unstacked*100)
pct_change = pct_change.to_dict()
In [43]:
## adjust the pay amount to 2007 dollars
def adjust_wage(x):
    year,gender,freq = [x.iloc[0][col] 
                        for col in ['Fiscal Year','Gender',
                                    'pay frequency']]
    x['pay amount 2007 dollars'] = x['pay amount'] - \
            pct_change[(freq,gender)][year]*x['pay amount']/100
    return x
In [44]:
## insert adjusted pay amount and apply adjusting function
data.insert(len(data.columns),'pay amount 2007 dollars',[0]*len(data))
data = data.groupby(['Fiscal Year','Gender','pay frequency'])\
                    .apply(lambda x: adjust_wage(x))
In [45]:
## top of new data frame
Fiscal Year Department Agency/Institution Gender Place of Residence Position pay amount pay frequency pay amount 2007 dollars
0 2007 Administrative Services, Department of NaN F POLK ADVANCED PERSONNEL MGMT 31.62 HR 31.62
2 2007 Administrative Services, Department of NaN F MITCHELL ADVANCED PERSONNEL MGMT 2212.80 WK 2212.80
3 2007 Administrative Services, Department of NaN F POLK ACCOUNTING TECHNICIAN 2 1226.40 WK 1226.40
In [46]:
## bottom of new data frame
Fiscal Year Department Agency/Institution Gender Place of Residence Position pay amount pay frequency pay amount 2007 dollars
605922 2016 Iowa Workforce Development NaN F POLK CLERK-ADVANCED 1460.8 WK 1109.038942
605923 2016 Iowa Workforce Development NaN M CLAY WORKFORCE ADVISOR 2169.6 WK 1837.842658
605924 2016 Iowa Workforce Development NaN F MARION WORKFORCE ADVISOR 2344.0 WK 1779.564130

It is interesting to see in aggregate that female's wages are significantly higher for employees who get paid hourly and daily, while much lower for employees who get paid yearly. The fact that median wages for females who get paid yearly is lower than males was expected for me but I did not expect that in aggregate females' hourly pay would be higher than males'.

Also cool to see that the gap in median weekly pay has closed over the 9 years spanned by the dataset.

Now, the aggregate statistics, while interesting, do not really tell us specific information in regards to fairness of wages. For example, what if skilled female workers end up taking more hourly paying jobs rather than longer commitment jobs because they do not plan to be in the work force for extended periods of time. This would mean that female hourly workers are more skilled in aggregate than male hourly workers and should thus command a higher wage. To really examine any gap in pay, we will use the tuples saved from before of places,positions,departments,pay frequency with large enough samples of both genders to see how the wages of the two genders compare. Of course this is not the complete story since within one position and department and pay frequency, there could be different levels of responsibility or seniority that might correspond with rightly higher earnings. However, with this informantion, this is the best we can do.

Within each identifying tuple, I am going to look at the difference in median pay of the males and the females of the group. I'll divide by the overall median of the group to obtain a standardized effect size. For each tuple, I'll compute this statistic and also a 95% confidence interval of the statistic that results from boostrapped sampling of each subgroup (gender within a group).

$ Pay \ Gap \ as \ a \ Percent = \frac{(Median \ Wage \ of \ Males \ in \ Position - Median \ Wage \ of \ Females \ in \ Position)}{Median \ Wage \ of \ Position} * 100 $

In [47]:
## bootstap confidence interval function
def get_ci(x1,x2,its=5000):
    stats = []
    overall_median = np.median(np.concatenate([x1,x2]))
    for _ in range(its):
        x1_ = np.random.choice(x1,size=len(x1),replace=True)
        x2_ = np.random.choice(x2,size=len(x2),replace=True)
        stats.append((np.median(x1_) - 
                      np.median(x2_))/overall_median * 100)
    return (round(np.std(stats),2),
            tuple(map(lambda x: round(x,2),
In [48]:
## dict that will become dataframe
results = OrderedDict([('Department',[]),('Place of Residence',[]),
                       ('Position',[]),('Pay Frequency',[]),
                       ('Median Wage',[]),('Pay Gap as Percentage',[]),
                       ('Num. of Males',[]),('Num. of Females',[]),
                       ('Standard Deviation',[]),('95% C.I.',[])])

## for each tuple with enough of each gender, 
## calculate and save statistics
for place,position,department,payfreq in valid_tuples:
    subset = data[(data['Place of Residence']==place) & 
                  (data.Position==position) &
                  (data.Department==department) & 
                  (data['pay frequency']==payfreq)]
    male_pay = subset[subset.Gender=='M']['pay amount'].values
    female_pay = subset[subset.Gender=='F']['pay amount'].values
    overall_median = np.median(np.concatenate([male_pay,female_pay]))
    results['Place of Residence'].append(place)
    results['Pay Frequency'].append(payfreq)
    diff_medians = np.median(male_pay) - np.median(female_pay)
    results['Pay Gap as Percentage'].append(round(diff_medians\
    std,ci = get_ci(male_pay,female_pay)
    results['Standard Deviation'].append(std)
    results['95% C.I.'].append(ci)
    results['Median Wage'].append(int(overall_median))
    results['Num. of Males'].append(len(male_pay))
    results['Num. of Females'].append(len(female_pay))

I'll display the table ordered by the size of the effect. But I will order by the lower bound of the confidence interval of the effect size in order to show more significant results higher. A negative Pay Gap indicates that the median income of women is more than the median income of men, while a positive indicates higher pay for men.

In [49]:
## make dataframe and sort it by the bound
## of the ci of the effect size
results_df = pd.DataFrame(results)
lower_bound_effect_size = [ci[0] 
                           if gap>0 else -1*ci[1] 
                           for ci,gap in 
                           zip(results_df['95% C.I.'],
                               results_df['Pay Gap as Percentage'])]
Department Place of Residence Position Pay Frequency Median Wage Pay Gap as Percentage Num. of Males Num. of Females Standard Deviation 95% C.I.
0 University of Iowa JOHNSON Adjunct Assistant Professor YR 30766 -107.25 114 95 25.66 (-148.15, -55.26)
1 University of Iowa JOHNSON Clinical Associate Professor YR 161834 36.30 841 722 4.29 (27.99, 44.01)
2 University of Iowa JOHNSON Clinical Assistant Professor YR 149350 29.13 1072 1230 2.27 (24.66, 33.48)
3 University of Iowa JOHNSON Associate Dean YR 175400 29.37 216 116 4.45 (22.05, 38.82)
4 University of Iowa JOHNSON Professor YR 141000 18.51 5454 1429 1.48 (15.97, 21.77)
5 University of Iowa JOHNSON Associate YR 109242 34.00 264 213 8.78 (15.87, 51.6)
6 University of Iowa JOHNSON Associate Professor YR 89001 16.98 2709 1612 1.04 (14.85, 18.86)
7 Legislative Branch POLK LEGISLATIVE EMPLOYEES-FU WK 2364 26.05 212 227 5.89 (14.23, 38.19)
8 University of Iowa JOHNSON Clinical Professor YR 190871 22.20 604 265 4.94 (13.96, 33.18)
9 University of Iowa JOHNSON Assistant Professor YR 85036 16.43 1916 1329 1.27 (13.76, 18.61)
10 Iowa State University STORY ASSOC PROF YR 82842 13.76 2365 1306 0.86 (12.02, 15.48)
11 University of Iowa JOHNSON Senior Imaging Technologist YR 56267 15.68 120 272 2.87 (11.15, 22.41)
12 University of Iowa JOHNSON Lecturer YR 49000 15.41 557 813 3.22 (10.82, 23.88)
13 Iowa State University STORY ASST PROF YR 73152 12.74 1419 1029 1.20 (10.16, 14.92)
14 University of Iowa JOHNSON Physician Assistant YR 90971 13.84 110 336 2.27 (9.44, 18.91)
15 Iowa State University STORY SENIOR LECTURER YR 46890 17.82 305 377 5.08 (8.85, 29.86)
16 University of Iowa JOHNSON Senior Physical Therapist YR 73656 17.80 77 160 4.53 (8.72, 25.42)
17 Iowa State University STORY PROF YR 112261 11.10 3004 654 1.32 (8.48, 13.47)
18 Iowa State University STORY ASSOC SCIENTIST YR 71640 10.53 386 136 1.77 (7.43, 14.4)
19 Human Services, Department of POTTAWATTAMIE RESIDENT TREATMENT WORKE WK 1506 -12.64 150 287 2.27 (-15.77, -7.33)

So 18 of the top 20 most significant effects are for men's wages being higher than womens'. Overall, the jobs with the strongest significance in pay gap seem to be high paying jobs - jobs at the university that require a high level of education.

There is one extreme outlier which is the position of 'Adjunct Assistant Professor'. Suprisingly, the women employed under this title seem to have much higher earnings than their male counterparts. I am not sure if this is just some sort of artifact of some strange hiring program. In the next cell I make a histogram showing just the individual wages at this position for men and women. It seems to be the case that there is a large variation in salary at this position and that men disproportionally have the lower salaries. Without more information I don't know what to make of this and I'll just label it a curiosity.

In [50]:
## plot a histogram of the outlier position
## of Adjunct Assitant Professor
subset = data[(data.Department=='University of Iowa') &
              (data['Place of Residence']=='JOHNSON') &
              (data.Position=='Adjunct Assistant Professor') &
              (data['pay frequency']=='YR')]
plt.hist([subset[subset.Gender=='M']['pay amount'],
          subset[subset.Gender=='F']['pay amount']],
plt.title("University of Iowa Adjunct Assitant Professors' Wages")

Let's investigate the correlation between Median Wage and Pay Gap as a Percentage. We'll look at this for each Pay Frequency since it is inappropriate to compare wages for different pay frequencies. First here are some simple summary statistics of Pay Gap as a Percentage for each Pay Frequency.

In [51]:
## summarize Pay Gap as Percentage across different Pay Frequencies
results_df.groupby('Pay Frequency')\
            .agg({'Pay Gap as Percentage':
Pay Gap as Percentage
median mean count
Pay Frequency
DAY -6.04 -6.040000 1
HR -3.14 -1.880000 4
WK 0.01 1.815435 46
YR 1.63 3.436991 113

Since there is only 1 position to compare for daily wages and 4 for hourly wages, it is difficulty to draw any conclusion from these pay frequencies. For weekly paid jobs, the median pay gap is one hundreth of a percent with the mean is 1.8%. For Yearly, the median is 1.63% and the mean is 3.44%. The gap between the means and medians of these two distributions indicates that they are skewed.

Let's look at the correlation between Median Wage and Pay Gap as a Percentage. We'll use Spearman's ranking correlation which I like to use because it does not assume normality of the distributions (good since the distribution of Pay Gap does not seem normal) and also since it's a really nice function in SciPy. I'll apply the Spearman correlation function to each Pay Frequency and what will result is a tuple - (rank correlation coefficient, pvalue).

In [52]:
## rank correlation coefficients between Median Wage and Pay Gap 
## as a Percentage for each Pay Frequency
results_df.groupby('Pay Frequency')\
          .apply(lambda x: spearmanr(x['Median Wage'],
                                     x['Pay Gap as Percentage']))\
          .to_frame().rename(columns={0:'(Correlation Coefficient,'+\
                                        ' p-value)'})
(Correlation Coefficient, p-value)
Pay Frequency
DAY (nan, nan)
HR (0.316227766017, 0.683772233983)
WK (0.0673557572346, 0.656491408751)
YR (0.35715830782, 0.000103039787606)

Median Wage and Pay Gap as a Percentage are positively correlated for each Pay Frequency. For weekly pay, although there is a positive correlation coefficient, the pvalue is not significant. For yearly pay however, there is a very strong correlation between Median Wage and Pay Gap as as Percentage.

To graph this relationship, I will bin the median wage variable into 6 bins and display a chart of the median Pay Gap as a Percentage for each bin. I'll do this for yearly wages.

In [53]:
## matplotlib is amazing!

yearly = results_df[(results_df['Pay Frequency']=='YR')]
n_bins = 6
edges = np.histogram(yearly['Median Wage'].values,bins=n_bins)[1]
x = np.digitize(yearly['Median Wage'].values,edges)
fig, ax1 = plt.subplots(figsize=(15,10))
ax2 = ax1.twinx(),n_bins+1),[yearly[x==i]['Pay Gap as Percentage']\
                           for i in range(1,n_bins+1)]
        ,width=1.0,label='Median Pay Gap as Percentage Inside Bin')[_+.4 for _ in range(1,n_bins+1)],
        [len(yearly[x==i]) for i in range(1,n_bins+1)],
        width=.2,color='red',label='Number of Jobs in Bin')
                              for edge in edges])
ax1.tick_params(labelsize=22); ax2.tick_params(labelsize=22);
ax1.set_ylim(0,35); ax2.set_ylim(0,60);
ax1.set_xlabel('Median Wage Bin Edges',fontsize=24,labelpad=24);
ax1.set_ylabel('Median Pay Gap as Percentage',
ax2.set_ylabel('Number of Jobs in Bin',rotation=-90,
plt.title('Median Wage vs. Pay Gap as Percentage '+\
          'for Yearly Paid Jobs',fontsize=26,y=1.03);


  1. Female' wages for more frequently (hourly or daily) paying jobs are higher. To venture a guess, skilled women who do not want jobs that require more commitment are not moving up to typically more secure, less frequently paying jobs for the sake of raising a family. The women who take these more frequently paying jobs are more skilled than their male counterparts and therefore command higher salaries.

  2. In the middle of the spectrum of frequency of pay, weekly paying jobs, there is very little gap between women's and men's wages.

  3. At the least frequent end of the spectrum, yearly paying positions, there is a median Pay Gap of around 1.63%. The distribution of Pay Gap is right-skewed with some positions having very large gaps. Positions with large gaps tend to be the higher paying positions. Indeed, there is a very significant positive correlation between median pay for yearly paying positions and the Pay Gap as a Percentage.

Written on January 1, 2017