Reputation: 47
I'm trying to find the wage gap between genders given a set of majors.
Here is a text version of my table:
gender field group logwage
0 male BUSINESS 7.229572
10 female BUSINESS 7.072464
1 male COMM/JOURN 7.108538
11 female COMM/JOURN 7.015018
2 male COMPSCI/STAT 7.340410
12 female COMPSCI/STAT 7.169401
3 male EDUCATION 6.888829
13 female EDUCATION 6.770255
4 male ENGINEERING 7.397082
14 female ENGINEERING 7.323996
5 male HUMANITIES 7.053048
15 female HUMANITIES 6.920830
6 male MEDICINE 7.319011
16 female MEDICINE 7.193518
17 female NATSCI 6.993337
7 male NATSCI 7.089232
18 female OTHER 6.881126
8 male OTHER 7.091698
9 male SOCSCI/PSYCH 7.197572
19 female SOCSCI/PSYCH 6.968322
diff hasn't worked for me, as it will take the difference between every consecutive major.
and here is the code as it is now:
for row in sorted_mfield:
if sorted_mfield['field group']==sorted_mfield['field group'].shift(1):
diff= lambda x: x[0]-x[1]
My next strategy would be to go back to the unsorted dataframe where male and female were their own columns and make a difference from there, but since I've spent an hour trying to do this, and am pretty new to pandas, I thought I would ask and find out how this works. Thanks.
Upvotes: 2
Views: 48
Reputation: 2019
Solution using Pandas.DataFrame.shift() in a sorted version of the data:
df.sort_values(by=['field group', 'gender'], inplace=True)
df['gap'] = df.logwage - df.logwage.shift(1)
df[df.gender =='male'][['field group', 'gap']]
Producing the following output with the sample data:
field group gap
0 BUSINESS 0.157108
2 COMM/JOURN 0.093520
4 COMPSCI/STAT 0.171009
6 EDUCATION 0.118574
8 ENGINEERING 0.073086
10 HUMANITIES 0.132218
12 MEDICINE 0.125493
15 NATSCI 0.095895
17 OTHER 0.210572
18 SOCSCI/PSYCH 0.229250
Note: it considers that you will always have a pair of values for each field group. If you want to validate it or eliminate field groups without this pair, the code below does the filtering:
df_grouped = df.groupby('field group')
df_filtered = df_grouped.filter(lambda x: len(x) == 2)
Upvotes: 2
Reputation: 59549
I'd consider reshaping your DataFrame with pivot
, making it easier to then compute.
df.pivot(index='field group', columns='gender', values='logwage').rename_axis([None], axis=1)
# female male
#field group
#BUSINESS 7.072464 7.229572
#COMM/JOURN 7.015018 7.108538
#COMPSCI/STAT 7.169401 7.340410
#EDUCATION 6.770255 6.888829
#ENGINEERING 7.323996 7.397082
#HUMANITIES 6.920830 7.053048
#MEDICINE 7.193518 7.319011
#NATSCI 6.993337 7.089232
#OTHER 6.881126 7.091698
#SOCSCI/PSYCH 6.968322 7.197572
df.male - df.female
#field group
#BUSINESS 0.157108
#COMM/JOURN 0.093520
#COMPSCI/STAT 0.171009
#EDUCATION 0.118574
#ENGINEERING 0.073086
#HUMANITIES 0.132218
#MEDICINE 0.125493
#NATSCI 0.095895
#OTHER 0.210572
#SOCSCI/PSYCH 0.229250
#dtype: float64
Upvotes: 2