Reputation: 311
I have the following df:
A B C
entry1 10 3
entry1 7 5
entry1 2 NaN
entry2 1 1
entry2 2 4
entry2 3 3
...
In column A I have duplicated entries and I am interested to pick the highest values from column B and C for each entry. In case that NaN
value exists, this overwrites the highest values.
Example:
For entry1, I would like to keep in the df just value 10
in column B and NaN
in column C.
For entry2, I would like to keep in the df just value 3
in column B and 4
in column C.
I was thinking to sort the values and then to pick the last value like this:
df = df.sort_values(by=['A', 'B', 'C'], ascending=[False, True, True])
print(df.drop_duplicates(subset='A', keep='last'))
In this way, the remaining entry will have just the highest value for column B, not also for column C.
How I can achieve that as well?
Thank you.
Upvotes: 2
Views: 815
Reputation: 35626
One approach would be to fillna
the nan
with inf
then just take the groupby max
to get the max value per group. Then replace
the inf
back to nan
. This ensures that a nan
value will be maximal per group since inf
is:
cols = ['B', 'C']
df[cols] = df[cols].fillna(np.inf) # replace NaN with largest value
df = df.groupby('A', as_index=False)[cols].max() # Get Max Per Group
df[cols] = df[cols].replace(np.inf, np.nan) # Return to NaN
Alternatively use a function or lambda to produce nan
if any
values are NaN otherwise the max
:
df = (
df.groupby('A', as_index=False)[['B', 'C']]
.agg(lambda s: np.nan if s.isna().any() else s.max())
)
Both Produce df
:
A B C
0 entry1 10 NaN
1 entry2 3 4.0
DataFrame and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A': ['entry1', 'entry1', 'entry1', 'entry2', 'entry2', 'entry2'],
'B': [10, 7, 2, 1, 2, 3],
'C': [3.0, 5.0, np.nan, 1.0, 4.0, 3.0]
})
Upvotes: 3