Reputation: 187
From a dataframe like this
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['a'] * 20 + ['b'] * 20,
'col2': (['c'] * 10 + ['d'] * 10)*2,
'col3': ( ['x'] * 3 + ['y'] * 7)*4,
'col4': np.random.rand(40),
'col5': np.random.rand(40),
}
)
I want to obtain the value of 'col3' where 'col4' is maximum and, if there are two or more rows with the same 'col4' value, choose the one where 'col5' is minimum.
dg = df.groupby(('col1', 'col2'), sort=False)['col4'].max()
dh = df.groupby(('col1', 'col2'), sort=False)['col5'].min()
So far I have come up with this,
dg = df.groupby(('col1', 'col2'), sort=False)\
.agg({'col4':'max', 'col5':'min'})
but it gives me the maximum 'col4' and minimum 'col5' independently, whereas I want the 'col5' value to be the one corresponding to the maximum 'col4' (if there is only one 'col4' maximum row).
Additionally, I would want to have the 'col3' value correspondinig to the max 'col4'. I have this now:
df.loc[df.groupby(('col1', 'col2'), sort=False)['col4'].idxmax()]['col3'].reset_index()['col3']
which gives me the column I want, but when I put it in the new dataframe, I don't get what I expect, which would be the max 'col4' and min 'col5' and their corresponding 'col3' value:
dg['col3'] = df.loc[df.groupby(('col1', 'col2'), sort=False)['col4'].idxmax()]['col3'].reset_index()['col3']
Example:
import numpy as np
import pandas as pd
np.random.seed(2020)
df = pd.DataFrame({ 'col1': ['a'] * 10 + ['b'] * 10, 'col2': (['c'] * 5 + ['d'] * 5)*2, 'col3': ( ['x'] * 5 + ['y'] * 5)*2, 'col4': np.random.randint(5, size=20), 'col5': np.random.randint(5, size=20), } )
col1 col2 col3 col4 col5
0 a c x 0 4
1 a c x 0 1
2 a c x 3 1
3 a c x 3 2
4 a c x 3 1
5 a d y 3 2
6 a d y 0 4
7 a d y 0 4
8 a d y 0 2
9 a d y 0 3
10 b c x 2 4
11 b c x 1 1
12 b c x 3 4
13 b c x 3 1
14 b c x 2 3
15 b d y 3 2
16 b d y 0 0
17 b d y 4 1
18 b d y 4 1
19 b d y 0 2
The expected output would be:
col1 col2 col3 col4 col5
a c x 3 1
a c x 3 1
a d y 3 2
b c x 3 1
b d y 4 1
b d y 4 1
Upvotes: 2
Views: 92
Reputation: 863701
Use:
np.random.seed(2020)
df = pd.DataFrame({ 'col1': ['a'] * 10 + ['b'] * 10, '
col2': (['c'] * 5 + ['d'] * 5)*2,
'col3': ( ['x'] * 5 + ['y'] * 5)*2,
'col4': np.random.randint(5, size=20),
'col5': np.random.randint(5, size=20), } )
print (df)
col1 col2 col3 col4 col5
0 a c x 0 4
1 a c x 0 1
2 a c x 3 1
3 a c x 3 2
4 a c x 3 1
5 a d y 3 2
6 a d y 0 4
7 a d y 0 4
8 a d y 0 2
9 a d y 0 3
10 b c x 2 4
11 b c x 1 1
12 b c x 3 4
13 b c x 3 1
14 b c x 2 3
15 b d y 3 2
16 b d y 0 0
17 b d y 4 1
18 b d y 4 1
19 b d y 0 2
First are filtered all rows with boolean indexing
by maximal col4
by GroupBy.transform
and comparing with col4
with Series.eq
df1 = df[df.groupby(['col1', 'col2'])['col4'].transform('max').eq(df['col4'])]
print (df1)
col1 col2 col3 col4 col5
2 a c x 3 1
3 a c x 3 2
4 a c x 3 1
5 a d y 3 2
12 b c x 3 4
13 b c x 3 1
17 b d y 4 1
18 b d y 4 1
#if need only some columns filter by list
cols = ['col1','col2','col3', 'col4','col5']
mask = df1.groupby(['col1', 'col2'], sort=False)['col5'].transform('min').eq(df1['col5'])
df2 = df1.loc[mask, cols]
print (df2)
col1 col2 col3 col4 col5
2 a c x 3 1
4 a c x 3 1
5 a d y 3 2
13 b c x 3 1
17 b d y 4 1
18 b d y 4 1
Upvotes: 2