Reputation: 150

Grouping by a column based on the minimum year value returned from the max value of another column

Apologies for the confusing title. It is explained better below. I currently have a pandas data frame that looks something like this:

user_id year grade_id
1       2005 47
1       2003 70
1       2004 70
2       2011 50
2       2003 43
2       2009 60

I want to group by the user_id and return the minimum year value based on the max value of grade_id. So the output for the above data frame would look like so:

user_id year grade_id
1       2003 70
2       2009 60

Is there a simple/elegant way to do this? I have tried things like the following:

tmp_df = df.groupby(["user_id", "year"])["grade_id"].agg(np.max)

However, this does not return the correct year. I have already checked out a few StackOverflow posts but none seem to be the same issue. Any help would be much appreciated.

Upvotes: 1

Answers (3)

Michael Dowling

Reputation: 26

This answer will work well on large data frames.

df.sort_values('grade_id', ascending=False).drop_duplicates(['user_id'])

Hope this helps.

Upvotes: 1

ATL

Reputation: 591

This might work with your example

df.groupby(['user_id', 'year']).grade_id.max().groupby(level=0).head(1).reset_index(name='grade_id')

Upvotes: 0

Celius Stingher

Reputation: 18367

Given the information this should work for you! I couldn't find an 'easy' way to solve it, but this generates the expected output!

import pandas as pd
data = {'user_id':[1,1,1,2,2,2],'year':[2005,2003,2004,2011,2003,2009],'grade_id':[47,70,70,50,43,60]}
df = pd.DataFrame(data)
print(df)
final_df = df.groupby('user_id').agg({'grade_id':'max'}).reset_index().merge(df[['grade_id','year']],how='left').sort_values('year').drop_duplicates('user_id')
print(final_df)

Output:

   user_id  grade_id  year
0        1        70  2003
2        2        60  2009

Upvotes: 1

Grouping by a column based on the minimum year value returned from the max value of another column

Answers (3)

Related Questions