Reputation: 399
I have written a data frame using pandas in python that consists of numbers and strings:
import pandas as pd
import numpy as np
d = {'col1': [1, 5,1,5,1,5,1,5], 'col2': [20,20,40,40,20,20,40,40],\
'col3': np.arange(1,9)*0.1,\
'col4':['Jen','Ross','Chan','Mon','Joe','Phebe','Janice','Gunter']}
df = pd.DataFrame(data=d)
The data frame looks like:
col1 col2 col3 col4
0 1 20 0.1 Jen
1 5 20 0.2 Ross
2 1 40 0.3 Chan
3 5 40 0.4 Mon
4 1 20 0.5 Joe
5 5 20 0.6 Phoebe
6 1 40 0.7 Janice
7 5 40 0.8 Gunther
Using groupby, I want to find the min value of col3 for each unique pair of (x,y) where x belongs to col1 and y belongs to col2. I also need to know what is the corresponding name in col4 that results in such a min. I did:
df2=df.groupby(['col1','col2']).min()
print(df2)
I got:
col1 col2 col3 col4
1 20 0.1 Jen
40 0.3 Chan
5 20 0.2 Phoebe
40 0.4 Gunther
As you see, the last two rows are wrong. Looking at the pair of (5,20) in the table, the min value is 0.2 (correctly) but appears for ross (not for Phoebe). How can I fix this?
Upvotes: 1
Views: 1734
Reputation: 294218
nsmallest
IMO, use WeNYoBen's drop_duplicates
answer.
The advantage of this answer is that it generalizes easily to take the n
number of rows from each group.
pd.concat(d.nsmallest(n=1, columns=['col3']) for _, d in df.groupby(['col1', 'col2']))
col1 col2 col3 col4
0 1 20 0.1 Jen
2 1 40 0.3 Chan
1 5 20 0.2 Ross
3 5 40 0.4 Mon
Upvotes: 3
Reputation: 323226
We can use idxmin
fix that
df.loc[df.groupby(['col1','col2'])['col3'].idxmin(),:]
col1 col2 col3 col4
0 1 20 0.1 Jen
2 1 40 0.3 Chan
1 5 20 0.2 Ross
3 5 40 0.4 Mon
Or drop_duplicates
df.sort_values(['col3']).drop_duplicates(['col1','col2'])
col1 col2 col3 col4
0 1 20 0.1 Jen
1 5 20 0.2 Ross
2 1 40 0.3 Chan
3 5 40 0.4 Mon
Upvotes: 5