Python pandas groupby then filter row-wise and return a count

Question

I have a the following dataframe:

df = pd.DataFrame({
    'cluster': ['A','B','C','A','B','C','D','D'],
    'profit': [-1.0,1.5,1,0.5,3.0,-2,-1, -2]
    })

I am performing a series of groupby operations before outputting into another datafram, most of which I have got to work.

df['cluster_total_profit'] = df.groupby(['cluster'])['profit'].transform('sum')

df['cluster_mean_profit'] = df.groupby(['cluster'])['profit'].transform('mean')

df['occurances'] = df.groupby(['cluster'])['profit'].transform('count')

df['std'] = df.groupby(['cluster'])['profit'].transform('std')

clusters = df[['cluster','cluster_total_profit', 'cluster_mean_profit', 'occurances', 'std']].drop_duplicates().reset_index(drop=True)

The output is as follows:

  cluster  cluster_total_profit  cluster_mean_profit  occurances      std
0       A                  -0.5                -0.25           2  1.06066
1       B                   4.5                 2.25           2  1.06066
2       C                  -1.0                -0.50           2  2.12132
3       D                  -3.0                -1.50           2  0.707107

The last transformation I am trying to get to work is to count the number of profitable events in each group and populate the df with the number of such events. The the output could be collected in the table above which would look like the following:

  cluster  cluster_total_profit  cluster_mean_profit  occurances      std    profitable_events
0       A                  -0.5                -0.25           2  1.06066    1
1       B                   4.5                 2.25           2  1.06066    2
2       C                  -1.0                -0.50           2  2.12132    1
3       D                  -3.0                -1.50           2  0.707107   0

I have looked here and here but I can't quite get the examples to tranlate to my exact usecase. Here is my code:

df['profitable_events'] = df.cluster.map(df.groupby(['cluster']).filter(lambda x: x[x['profit'] > 0.0].count()))

clusters = df[['cluster','cluster_total_profit', 'cluster_mean_profit', 'occurances', 'std', 'profitable_events']].drop_duplicates().reset_index(drop=True)

and:

df['profitable_events'] = df.groupby(['cluster']).filter(lambda x: x[x['profit'] > 0.0]).transform('count')

Both throw an error "TypeError: filter function returned a Series, but expected a scalar bool"

I also tried:

df['profitable_events'] = df.cluster.map(df.groupby(['cluster']).filter(lambda x: len(x[x['profit'] > 0.0].index)))

Which errored: "TypeError: filter function returned a int, but expected a scalar bool"

I'm sure there is a quic fix but am not sure what it is?

Many thanks in advance

Quang Hoang · Accepted Answer

You can use a custom function to count the profitable event:

df.groupby('cluster')['profit'].agg([
    'sum','mean','count','std',
    ('profitable_event', lambda x: x.gt(0).sum())   
])

Output

         sum  mean  count       std  profitable_event
cluster                                              
A       -0.5 -0.25      2  1.060660               1.0
B        4.5  2.25      2  1.060660               2.0
C       -1.0 -0.50      2  2.121320               1.0
D       -3.0 -1.50      2  0.707107               0.0

Python pandas groupby then filter row-wise and return a count

Answers (1)

Related Questions