Get the value of a column where one column is max and another is min

Question

From a dataframe like this

import pandas as pd
import numpy as np

df = pd.DataFrame({
            'col1': ['a'] * 20 + ['b'] * 20,
            'col2': (['c'] * 10 + ['d'] * 10)*2,
            'col3': ( ['x'] * 3 + ['y'] * 7)*4,
            'col4': np.random.rand(40),
            'col5': np.random.rand(40),
            }
            )

I want to obtain the value of 'col3' where 'col4' is maximum and, if there are two or more rows with the same 'col4' value, choose the one where 'col5' is minimum.

dg = df.groupby(('col1', 'col2'), sort=False)['col4'].max()
dh = df.groupby(('col1', 'col2'), sort=False)['col5'].min()

So far I have come up with this,

dg = df.groupby(('col1', 'col2'), sort=False)\
        .agg({'col4':'max', 'col5':'min'})

but it gives me the maximum 'col4' and minimum 'col5' independently, whereas I want the 'col5' value to be the one corresponding to the maximum 'col4' (if there is only one 'col4' maximum row).

Additionally, I would want to have the 'col3' value correspondinig to the max 'col4'. I have this now:

df.loc[df.groupby(('col1', 'col2'), sort=False)['col4'].idxmax()]['col3'].reset_index()['col3']

which gives me the column I want, but when I put it in the new dataframe, I don't get what I expect, which would be the max 'col4' and min 'col5' and their corresponding 'col3' value:

dg['col3'] = df.loc[df.groupby(('col1', 'col2'), sort=False)['col4'].idxmax()]['col3'].reset_index()['col3']

Example:

import numpy as np
import pandas as pd

np.random.seed(2020)
df = pd.DataFrame({ 'col1': ['a'] * 10 + ['b'] * 10, 'col2': (['c'] * 5 + ['d'] * 5)*2, 'col3': ( ['x'] * 5 + ['y'] * 5)*2, 'col4': np.random.randint(5, size=20), 'col5': np.random.randint(5, size=20), } )

   col1 col2 col3  col4  col5
0     a    c    x     0     4
1     a    c    x     0     1
2     a    c    x     3     1
3     a    c    x     3     2
4     a    c    x     3     1
5     a    d    y     3     2
6     a    d    y     0     4
7     a    d    y     0     4
8     a    d    y     0     2
9     a    d    y     0     3
10    b    c    x     2     4
11    b    c    x     1     1
12    b    c    x     3     4
13    b    c    x     3     1
14    b    c    x     2     3
15    b    d    y     3     2
16    b    d    y     0     0
17    b    d    y     4     1
18    b    d    y     4     1
19    b    d    y     0     2

The expected output would be:

   col1 col2 col3  col4  col5
     a    c    x     3     1
     a    c    x     3     1
     a    d    y     3     2
     b    c    x     3     1
     b    d    y     4     1
     b    d    y     4     1

jezrael · Accepted Answer

Use:

np.random.seed(2020)
df = pd.DataFrame({ 'col1': ['a'] * 10 + ['b'] * 10, '
                   col2': (['c'] * 5 + ['d'] * 5)*2, 
                   'col3': ( ['x'] * 5 + ['y'] * 5)*2, 
                   'col4': np.random.randint(5, size=20), 
                   'col5': np.random.randint(5, size=20), } )
print (df)
   col1 col2 col3  col4  col5
0     a    c    x     0     4
1     a    c    x     0     1
2     a    c    x     3     1
3     a    c    x     3     2
4     a    c    x     3     1
5     a    d    y     3     2
6     a    d    y     0     4
7     a    d    y     0     4
8     a    d    y     0     2
9     a    d    y     0     3
10    b    c    x     2     4
11    b    c    x     1     1
12    b    c    x     3     4
13    b    c    x     3     1
14    b    c    x     2     3
15    b    d    y     3     2
16    b    d    y     0     0
17    b    d    y     4     1
18    b    d    y     4     1
19    b    d    y     0     2

First are filtered all rows with boolean indexing by maximal col4 by GroupBy.transform and comparing with col4 with Series.eq

df1 = df[df.groupby(['col1', 'col2'])['col4'].transform('max').eq(df['col4'])]
print (df1)
   col1 col2 col3  col4  col5
2     a    c    x     3     1
3     a    c    x     3     2
4     a    c    x     3     1
5     a    d    y     3     2
12    b    c    x     3     4
13    b    c    x     3     1
17    b    d    y     4     1
18    b    d    y     4     1

#if need only some columns filter by list
cols = ['col1','col2','col3', 'col4','col5']
mask = df1.groupby(['col1', 'col2'], sort=False)['col5'].transform('min').eq(df1['col5'])
df2 = df1.loc[mask, cols]
print (df2)
   col1 col2 col3  col4  col5
2     a    c    x     3     1
4     a    c    x     3     1
5     a    d    y     3     2
13    b    c    x     3     1
17    b    d    y     4     1
18    b    d    y     4     1

Get the value of a column where one column is max and another is min

Answers (1)

Related Questions