Reputation: 127
I have csv data:
Class, Code, Vendor, State, NumberOfDays
3, 123, Name1, NE, 12.58402778
1, 876, Name2, TX, 12.51041667
3, 123, Name1, NE, 2.354166667
1, 876, Name2, TX, 12.21111111
3, 456, Name2, NY, 6.346527778
2, 876, Name1, NY, 5.513194444
3, 123, Name1, NE, 5.38125
1, 876, Name2, TX, 5.409722222
And I have the follow code that gives me the mean
, min
, max
and median
of the NumberOfDays
field based on groupings of the other fields
df = pd.read_csv(r'C:...input.csv')
d = {'mean':'AverageDays','min':'LowestNumberOfDays','max':'HighestNumberOfDays','median':'MedianDays'}
df = (df.groupby(['Class','Code','Vendor','State'])['NumberOfDays']
.agg(['mean','min','max','median'])
.rename(columns=d)
.reset_index())
df.to_csv(r'C:...output.csv', index=False)
It does what I need perfectly:
Class,Code,Vendor,State,AverageDays,LowestNumberOfDays,HighestNumberOfDays,MedianDays
1,876,Name2,TX,10.043750000666666,5.409722222,12.51041667,12.21111111
2,876,Name1,NY,5.513194444,5.513194444,5.513194444,5.513194444
3,123,Name1,NE,6.773148149,2.354166667,12.58402778,5.38125
3,456,Name2,NY,6.346527778,6.346527778,6.346527778,6.346527778
When I add mode
to the code:
d = {'mean':'AverageDays','min':'LowestNumberOfDays','max':'HighestNumberOfDays','median':'MedianDays','mode':'ModeDays'}
df = (df.groupby(['Class','Code','Vendor','State'])['NumberOfDays']
.agg(['mean','min','max','median','mode'])
I get this errror:
AttributeError: Cannot access callable attribute 'mode' of 'SeriesGroupBy' objects, try using the 'apply' method
When I researched this I see that mode is not a natural function in pandas but my issue is I cannot figure out how to leverage the other ways to calculate mode, and incorporate it in this same step of the code since this step reduces my data set into condensed groups. So I cannot run it after and if I run it before I will need to apply the mode value repeatedly across all the rows of my data for each group. My data is already over 4 gig and 30 million records so I would like to avoid making it bigger.
Upvotes: 1
Views: 1424
Reputation: 323226
You still can do that
lambda x: x.mode()[0]
df.groupby(['Class',' Code',' Vendor',' State'])[' NumberOfDays'].agg(['mean','min','max','median',lambda x: x.mode()[0]])
Out[8]:
mean min max median \
Class Code Vendor State
1 876 Name2 TX 10.043750 5.409722 12.510417 12.211111
2 876 Name1 NY 5.513194 5.513194 5.513194 5.513194
3 123 Name1 NE 6.773148 2.354167 12.584028 5.381250
456 Name2 NY 6.346528 6.346528 6.346528 6.346528
<lambda>
Class Code Vendor State
1 876 Name2 TX 5.409722
2 876 Name1 NY 5.513194
3 123 Name1 NE 2.354167
456 Name2 NY 6.346528
Upvotes: 1
Reputation: 38415
You can use value_counts for mode
d = {'mean':'AverageDays','min':'LowestNumberOfDays','max':'HighestNumberOfDays','median':'MedianDays', '<lambda>': 'ModeDays'}
df.groupby(['Class','Code','Vendor','State'])['NumberOfDays']\
.agg(['mean','min','max','median', lambda x: x.value_counts().index[0]])\
.rename(columns=d)\
.reset_index()
Class Code Vendor State AverageDays LowestNumberOfDays HighestNumberOfDays MedianDays ModeDays
0 1 876 Name2 TX 10.043750 5.409722 12.510417 12.211111 12.510417
1 2 876 Name1 NY 5.513194 5.513194 5.513194 5.513194 5.513194
2 3 123 Name1 NE 6.773148 2.354167 12.584028 5.381250 2.354167
3 3 456 Name2 NY 6.346528 6.346528 6.346528 6.346528 6.346528
Upvotes: 2