Alex P
Alex P

Reputation: 45

Select max row per group in pandas dataframe

I have a dataframe with multiple attributes, some are repeating. I want to select the rows based on the max value in one column - but return the row having that value (not the max of every column). How??

Here's a sample:

df = pd.DataFrame({'Owner': ['Bob', 'Jane', 'Amy',
                            'Steve','Kelly'],
                   'Make': ['Ford', 'Ford', 'Jeep',
                           'Ford','Jeep'],
                   'Model': ['Bronco', 'Bronco', 'Wrangler',
                            'Model T','Wrangler'],
                   'Max Speed': [80, 150, 69, 45, 72],
                  'Customer Rating': [90, 50, 91, 75, 99]})

this gives us:

enter image description here

I want the row having the max(customer rating) for each Make/Model. Like this: enter image description here

Note this is NOT the same as df.groupby(['Make','Model']).max()

--> How do I do this?

Upvotes: 1

Views: 233

Answers (2)

Corralien
Corralien

Reputation: 120449

A variation of your answer using idxmax:

>>> df.loc[df.groupby(['Make', 'Model'])['Customer Rating'].idxmax()]
   Owner  Make     Model  Max Speed  Customer Rating
0    Bob  Ford    Bronco         80               90
3  Steve  Ford   Model T         45               75
4  Kelly  Jeep  Wrangler         72               99

Another solution without groupby:

>>> df.sort_values('Customer Rating') \
      .drop_duplicates(['Make', 'Model'], keep='last') \
      .sort_index()

   Owner  Make     Model  Max Speed  Customer Rating
0    Bob  Ford    Bronco         80               90
3  Steve  Ford   Model T         45               75
4  Kelly  Jeep  Wrangler         72               99

Upvotes: 1

Alex P
Alex P

Reputation: 45

I found an answer! I'm leaving the question up in case anyone else didn't recognize it as well.

Check out this post: Select the max row per group - pandas performance issue

I couldn't tell from that post that it was in fact what I needed, but it is. I tried two of them successfully:

def using_rank(df):
mask = (df.groupby(['Make', 'Model'])['Customer Rating'].rank(method='first', ascending=False) == 1)
return df.loc[mask]
df2 = using_rank(df)
df2

returns:

enter image description here

this also worked fine:

def using_sort(df):
df = df.sort_values(by=['Customer Rating'], ascending=False, kind='mergesort')
return df.groupby(['Make', 'Model'], as_index=False).first()

Upvotes: 0

Related Questions