Reputation: 150
Apologies for the confusing title. It is explained better below. I currently have a pandas data frame that looks something like this:
user_id year grade_id
1 2005 47
1 2003 70
1 2004 70
2 2011 50
2 2003 43
2 2009 60
I want to group by the user_id and return the minimum year value based on the max value of grade_id. So the output for the above data frame would look like so:
user_id year grade_id
1 2003 70
2 2009 60
Is there a simple/elegant way to do this? I have tried things like the following:
tmp_df = df.groupby(["user_id", "year"])["grade_id"].agg(np.max)
However, this does not return the correct year. I have already checked out a few StackOverflow posts but none seem to be the same issue. Any help would be much appreciated.
Upvotes: 1
Views: 108
Reputation: 26
This answer will work well on large data frames.
df.sort_values('grade_id', ascending=False).drop_duplicates(['user_id'])
Hope this helps.
Upvotes: 1
Reputation: 591
This might work with your example
df.groupby(['user_id', 'year']).grade_id.max().groupby(level=0).head(1).reset_index(name='grade_id')
Upvotes: 0
Reputation: 18367
Given the information this should work for you! I couldn't find an 'easy' way to solve it, but this generates the expected output!
import pandas as pd
data = {'user_id':[1,1,1,2,2,2],'year':[2005,2003,2004,2011,2003,2009],'grade_id':[47,70,70,50,43,60]}
df = pd.DataFrame(data)
print(df)
final_df = df.groupby('user_id').agg({'grade_id':'max'}).reset_index().merge(df[['grade_id','year']],how='left').sort_values('year').drop_duplicates('user_id')
print(final_df)
Output:
user_id grade_id year
0 1 70 2003
2 2 60 2009
Upvotes: 1