David Bear
David Bear

Reputation: 97

Using Pandas to apply a groupby aggregate to the original data frame

I want to do something that seems pretty easy in a spreadsheet but I can't figure the syntax in pandas. I have a data set that can be grouped. I want to determine the aggregate stats for each of the groups, but then use the aggregates to create a new column back in the original data frame.

For example, if my data frame looks like this:

d = pandas.dataframe({'class', : ['f1', 'f2', 'f3', 'f1'], 
'user': ['jack', 'jen', 'joe', 'jan'], 
'screen': [12, 23, 13, 15] })

yes its much smaller than my data set

I would like to do something like

d['gp'] = d['screen'].apply(d.groupby('class').stdev())

and assure that the d.groupby().stdev() is actually the stdev for that class for the row. In other words I don't want the stdev for class f1 to be used when calculating the gp for class f2, etc.

My brain is thinking in spreadsheet mode, or in a python for loop. I know there must be a simple pandas syntax to do this -- but so far I haven't found anything in my searches that seem to fit my use case.

Upvotes: 3

Views: 3180

Answers (2)

David Bear
David Bear

Reputation: 97

I'm working on this a little more and want to be a little more precise in defining what I want here. In my data set I have 3 groups of classes. I want to determine aggregate stats for each class; mean, std dev. So if I were doing this in pythonish pseudo code on a dict-list it would look something like this:

groupamean = mean(list of groupa['screens']) groupastddev = stddev(list of groupa['screens'])

for p in groupa: x.append = groupa['screens'] * groupamean + groupastdev

This would be repeated for each group. This is the way plain python leads me to think.

Pandas with the data frame object invites a new way to think. Its nice to not have to use for loops to do things over a series. But I don't know how to assure that when I apply the aggregate functions produced by groupby, that I get the correct groups.

The syntax that seems to get close is this

d['screengrade']= d['Screens Typed'].apply(lambda x: x / (classgroups.std + classgroups.mean) * 200 ) 

But this throws a TypeError.

Upvotes: 0

jezrael
jezrael

Reputation: 862511

It seems you need transform for return Series with same length as original Dataframe:

d['gp'] = d.groupby('class')['screen'].transform('std')
print (d)
  class  screen  user       gp
0    f1      12  jack  2.12132
1    f2      23   jen      NaN
2    f3      13   joe      NaN
3    f1      15   jan  2.12132

You get NaNs, because some groups (f2, f3) have length equal 1.

Upvotes: 6

Related Questions