quant
quant

Reputation: 4482

how to keep the value of a column that has the highest value on another column with groupby in pandas

I have the following dataframe:

import pandas as pd

df = pd.DataFrame({'var': ['A', 'A', 'B', 'B', 'C', 'C', 'C'],
                       'value': [1, 2, 1, 2, 3, 4, 5],
                       'input': [0.1, 0.1, 0.2, 0.2, 0.3, 0.3, 0.3]})

I would like to keep the var for which the value is the highest by input and set the rest of the var to NA.

So I would like to end up with:

   df = pd.DataFrame({'var': [np.nan, 'A', np.nan, 'B', np.nan, np.nan, 'C'],
                       'value': [1, 2, 1, 2, 3, 4, 5],
                       'input': [0.1, 0.1, 0.2, 0.2, 0.3, 0.3, 0.3]})

Any ideas ?

Upvotes: 1

Views: 39

Answers (2)

Mark Wang
Mark Wang

Reputation: 2757

where is a useful trick to set NaNs in a column based on some conditions.

df['var'].where(df.groupby('var').value.transform('max').eq(df['value']))

Output

0   NaN
1   A
2   NaN
3   B
4   NaN
5   NaN
6   C

Upvotes: 1

jezrael
jezrael

Reputation: 862901

Use GroupBy.transform with max for Series with same size like original DataFrame, compare for not equal by Series.ne and set new values with loc:

mask = df.groupby('var')['value'].transform('max').ne(df['value'])

df.loc[mask, 'var'] = np.nan
print (df)
   var  value  input
0  NaN      1    0.1
1    A      2    0.2
2  NaN      1    0.3
3    B      2    0.4
4  NaN      3    0.5
5  NaN      4    0.6
6    C      5    0.7

Upvotes: 3

Related Questions