Pylander
Pylander

Reputation: 1601

PANDAS Group By with Multiple Functions Applied

I have a PANDAS dataframe with the following structure for example:

id,sex,age,rank,skill
1,M,9,1,A
1,M,8,2,G
1,M,10,3,F
2,F,10,3,M
2,F,8,4,W
2,F,6,4,O
3,M,5,1,Q
3,M,4,3,N
3,M,9,4,Y

Where my desired output after the groupby/apply operation to the dataframe is:

id,sex,age,rank,skill
1,M,8,1,A
2,F,6,3,M
3,M,4,1,Q

In other words, I am looking to groupby the id field, sex field does not change, the min() of age value, the min() of rank value, and the skill value that was present at the the min() of rank value.

I understand that multiple agg functions can be passed to the groupby in a dict, but it how to handle the values that are constant or depend on the results of a function in another field of the groupby I do not understand.

Upvotes: 1

Views: 707

Answers (3)

kjmerf
kjmerf

Reputation: 4345

+1 for Wen.

Mine has a few more steps but it's the same idea and perhaps easier to read if you're not following:

func = {'sex': 'min', 'age': 'min', 'rank': 'min'}

df_agg = df.groupby('id').agg(func)
df_agg = df_agg.reset_index()

df = df.drop('age', 1)
df = pd.merge(df_agg, df, on = ['id', 'sex', 'rank'])

Set the aggregations you want to apply to each column. Then group by id, using agg. You need to reset the index at this point or else you won't be able to perform the merge in the next step, as id will be treated as the index.

df still stores your original data-frame. Drop age from df, as you'll only need the minimized age, stored in df_agg. Then perform the merge on the columns you'd expect to match: id, sex and rank. You are merging on rank to pull the correct skill along for the ride.

Upvotes: 1

BENY
BENY

Reputation: 323396

In you expected out put , it is min of rank , but in your explanation you mentioned it is max

My answer base on you expected output

df.groupby(['id','sex'],as_index=False).agg({'age':'min','rank':'min'}).\
merge(df.drop('age',1),on=['id','sex','rank'],how='left')

Out[931]: 
   id sex  age  rank skill
0   1   M    8     1     A
1   2   F    6     3     M
2   3   M    4     1     Q

Upvotes: 3

user2285236
user2285236

Reputation:

For columns that have constant values, you have several options: first, last, etc. For the skill value that corresponds to the highest (or minimum in your example) rank value, you need to use idxmin. For idxmin to work, skill should be the index so as the first step set it as index.

df.set_index('skill').groupby('id').agg({'sex': 'first', 
                                         'age': 'min', 
                                         'rank': ['min', 'idxmin']})
Out: 
     sex age rank       
   first min  min idxmin
id                      
1      M   8    1      A
2      F   6    3      M
3      M   4    1      Q

Upvotes: 1

Related Questions