daiyue
daiyue

Reputation: 7448

pandas groupby multiple columns did not sort values by default

I have the following df,

code      pct         year_month
10        6.6156      201905
10        6.0868      201905
10        5.8975      201905
10        11.2195     201905
10        11.1404     201905 

I like to do the following

df2 = df.sort_values('pct', ascending=False)
df2['pct'].cumsum().le(20).mean()
0.2

in a groupby way,

df.groupby(['year_month', 'code'])['pct'].apply(lambda x: x.cumsum().le(20).mean())

but the result is different,

year_month  code
201905      BR10    0.6

I thought groupby should sort pct in descending order by default, but it seems not, so I am wondering how to sort pct within in each year_month, code group first then do the cumsum;

Upvotes: 2

Views: 310

Answers (1)

jezrael
jezrael

Reputation: 862511

Your code is different, for same output need sorting by first 2 columns - ['year_month','code'] or with this sample data omit it (if already sorted):

print (df['pct'].cumsum().le(20).mean())
0.6

df2 = df.sort_values(['year_month','code'], ascending=False)
print (df2['pct'].cumsum().le(20).mean())
0.6

In groupby are sorting values in groupby - here ['year_month', 'code'], not pct:

df = df.groupby(['year_month', 'code'])['pct'].apply(lambda x: x.cumsum().le(20).mean())
print (df)
year_month  code
201905      10      0.6
Name: pct, dtype: float64

So for same output sorting here both by pct column and if necessary for prevent sorting by 'year_month', 'code'] is added sort=False:

df2 = df.sort_values('pct', ascending=False)
print (df2['pct'].cumsum().le(20).mean())
0.2

df = (df.sort_values(['pct'], ascending=False)
        .groupby(['year_month', 'code'], sort=False)['pct']
        .apply(lambda x: x.cumsum().le(20).mean()))
print (df)
year_month  code
201905      10      0.2
Name: pct, dtype: float64

Upvotes: 2

Related Questions