Reputation: 97
I want to do something that seems pretty easy in a spreadsheet but I can't figure the syntax in pandas. I have a data set that can be grouped. I want to determine the aggregate stats for each of the groups, but then use the aggregates to create a new column back in the original data frame.
For example, if my data frame looks like this:
d = pandas.dataframe({'class', : ['f1', 'f2', 'f3', 'f1'],
'user': ['jack', 'jen', 'joe', 'jan'],
'screen': [12, 23, 13, 15] })
I would like to do something like
d['gp'] = d['screen'].apply(d.groupby('class').stdev())
and assure that the d.groupby().stdev() is actually the stdev for that class for the row. In other words I don't want the stdev for class f1 to be used when calculating the gp for class f2, etc.
My brain is thinking in spreadsheet mode, or in a python for loop. I know there must be a simple pandas syntax to do this -- but so far I haven't found anything in my searches that seem to fit my use case.
Upvotes: 3
Views: 3180
Reputation: 97
I'm working on this a little more and want to be a little more precise in defining what I want here. In my data set I have 3 groups of classes. I want to determine aggregate stats for each class; mean, std dev. So if I were doing this in pythonish pseudo code on a dict-list it would look something like this:
groupamean = mean(list of groupa['screens']) groupastddev = stddev(list of groupa['screens'])
for p in groupa: x.append = groupa['screens'] * groupamean + groupastdev
This would be repeated for each group. This is the way plain python leads me to think.
Pandas with the data frame object invites a new way to think. Its nice to not have to use for loops to do things over a series. But I don't know how to assure that when I apply the aggregate functions produced by groupby, that I get the correct groups.
The syntax that seems to get close is this
d['screengrade']= d['Screens Typed'].apply(lambda x: x / (classgroups.std + classgroups.mean) * 200 )
But this throws a TypeError.
Upvotes: 0
Reputation: 862511
It seems you need transform
for return Series
with same length as original Dataframe
:
d['gp'] = d.groupby('class')['screen'].transform('std')
print (d)
class screen user gp
0 f1 12 jack 2.12132
1 f2 23 jen NaN
2 f3 13 joe NaN
3 f1 15 jan 2.12132
You get NaN
s, because some groups (f2
, f3
) have length equal 1.
Upvotes: 6