Reputation: 45
I have a dataframe with multiple attributes, some are repeating. I want to select the rows based on the max value in one column - but return the row having that value (not the max of every column). How??
Here's a sample:
df = pd.DataFrame({'Owner': ['Bob', 'Jane', 'Amy',
'Steve','Kelly'],
'Make': ['Ford', 'Ford', 'Jeep',
'Ford','Jeep'],
'Model': ['Bronco', 'Bronco', 'Wrangler',
'Model T','Wrangler'],
'Max Speed': [80, 150, 69, 45, 72],
'Customer Rating': [90, 50, 91, 75, 99]})
this gives us:
I want the row having the max(customer rating) for each Make/Model.
Like this:
Note this is NOT the same as df.groupby(['Make','Model']).max()
--> How do I do this?
Upvotes: 1
Views: 233
Reputation: 120449
A variation of your answer using idxmax
:
>>> df.loc[df.groupby(['Make', 'Model'])['Customer Rating'].idxmax()]
Owner Make Model Max Speed Customer Rating
0 Bob Ford Bronco 80 90
3 Steve Ford Model T 45 75
4 Kelly Jeep Wrangler 72 99
Another solution without groupby
:
>>> df.sort_values('Customer Rating') \
.drop_duplicates(['Make', 'Model'], keep='last') \
.sort_index()
Owner Make Model Max Speed Customer Rating
0 Bob Ford Bronco 80 90
3 Steve Ford Model T 45 75
4 Kelly Jeep Wrangler 72 99
Upvotes: 1
Reputation: 45
I found an answer! I'm leaving the question up in case anyone else didn't recognize it as well.
Check out this post: Select the max row per group - pandas performance issue
I couldn't tell from that post that it was in fact what I needed, but it is. I tried two of them successfully:
def using_rank(df):
mask = (df.groupby(['Make', 'Model'])['Customer Rating'].rank(method='first', ascending=False) == 1)
return df.loc[mask]
df2 = using_rank(df)
df2
returns:
this also worked fine:
def using_sort(df):
df = df.sort_values(by=['Customer Rating'], ascending=False, kind='mergesort')
return df.groupby(['Make', 'Model'], as_index=False).first()
Upvotes: 0