Albert
Albert

Reputation: 399

Pandas groupby gives wrong values

I have written a data frame using pandas in python that consists of numbers and strings:

import pandas as pd
import numpy as np
d = {'col1': [1, 5,1,5,1,5,1,5], 'col2': [20,20,40,40,20,20,40,40],\
     'col3': np.arange(1,9)*0.1,\
     'col4':['Jen','Ross','Chan','Mon','Joe','Phebe','Janice','Gunter']}
df = pd.DataFrame(data=d)

The data frame looks like:

col1    col2    col3    col4
0   1   20      0.1     Jen
1   5   20      0.2     Ross
2   1   40      0.3     Chan
3   5   40      0.4     Mon
4   1   20      0.5     Joe
5   5   20      0.6     Phoebe
6   1   40      0.7     Janice
7   5   40      0.8     Gunther

Using groupby, I want to find the min value of col3 for each unique pair of (x,y) where x belongs to col1 and y belongs to col2. I also need to know what is the corresponding name in col4 that results in such a min. I did:

df2=df.groupby(['col1','col2']).min()
print(df2)

I got:

col1    col2    col3    col4        
1       20      0.1     Jen
        40      0.3     Chan
5       20      0.2     Phoebe
        40      0.4     Gunther

As you see, the last two rows are wrong. Looking at the pair of (5,20) in the table, the min value is 0.2 (correctly) but appears for ross (not for Phoebe). How can I fix this?

Upvotes: 1

Views: 1734

Answers (2)

piRSquared
piRSquared

Reputation: 294218

With nsmallest

IMO, use WeNYoBen's drop_duplicates answer.

The advantage of this answer is that it generalizes easily to take the n number of rows from each group.

See nsmallest
See nlargest

pd.concat(d.nsmallest(n=1, columns=['col3']) for _, d in df.groupby(['col1', 'col2']))

   col1  col2  col3  col4
0     1    20   0.1   Jen
2     1    40   0.3  Chan
1     5    20   0.2  Ross
3     5    40   0.4   Mon

Upvotes: 3

BENY
BENY

Reputation: 323226

We can use idxmin fix that

df.loc[df.groupby(['col1','col2'])['col3'].idxmin(),:]
   col1  col2  col3  col4
0     1    20   0.1   Jen
2     1    40   0.3  Chan
1     5    20   0.2  Ross
3     5    40   0.4   Mon

Or drop_duplicates

df.sort_values(['col3']).drop_duplicates(['col1','col2'])
   col1  col2  col3  col4
0     1    20   0.1   Jen
1     5    20   0.2  Ross
2     1    40   0.3  Chan
3     5    40   0.4   Mon

Upvotes: 5

Related Questions