Reputation: 9482
Apologies if I have overlooked a similar problem that was solved elsewhere. These are the posts that I looked into before asking this question:
Pandas DataFrame groupby based on condition
pandas groupby replace based on condition
Python pandas dataframe group by based on a condition
Given a dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [1,2,2,3,3,4,5,5,6,6],
'b': np.random.rand(10),
'c': 10*[0],
})
that has a column a
that holds identities, and a column b
that has random values, I want to label replace values in column c
.
If there is only one entry for a
, no change should be applied.
If there are two entries for a
I want to relabel the entry in c
at which b
is minimal:
a b c
0 1 0.472015 0 # <-- only one entry for a => nothing changes
1 2 0.553018 2 # <-- b is minimal => c gets value 2
2 2 0.770302 0
3 3 0.992023 0
4 3 0.119448 2 # <-- b is minimal => c gets value 2
Combination of groupby
, agg
and .loc
g = df.groupby('a').agg({'b': [np.argmin, np.argmax]})
df.loc[g[g[('b', 'argmin')] != g[('b', 'argmax')]][('b', 'argmin')], 'c' ] = 2
This works, but seems quite roundabout.
is there a less clumsy way to go about this?
help is greatly appreciated!
Groups have either one or two members. If both members have the same b
value, nothing should happen.
df = pd.DataFrame({
'a': [1,2,2,3,3,4,5,5,6,6],
'b': [1,1,2,1,2,1,1,1,2,1],
'c': 10*[0],
})
df.loc[df['b'].eq(df.groupby('a')['b'].transform('min')),'c']=2
out:
a b c
0 1 1 2
1 2 1 2
2 2 2 0
3 3 1 2
4 3 2 0
5 4 1 2
6 5 1 2
7 5 1 2
8 6 2 0
9 6 1 2
df.loc[df.index==df.groupby('a')['b'].transform('idxmin'),'c']=2
a b c
0 1 1 2 # <-- a has only one member, so this shouldn't be changed
1 2 1 2
2 2 2 0
3 3 1 2
4 3 2 0
5 4 1 2
6 5 1 2
7 5 1 0
8 6 2 0
9 6 1 2
Upvotes: 1
Views: 1814
Reputation: 75080
You can use groupby+transform
with df.loc[]
:
df.loc[df['b'].eq(df.groupby('a')['b'].transform('min')),'c']=2
Example:
df.loc[df['b'].eq(df.groupby('a')['b'].transform('min')),'c']=2
print(df)
a b c
0 1 0.126956 2
1 2 0.249078 2
2 2 0.929619 0
3 3 0.013735 2
4 3 0.192781 0
5 4 0.268828 2
6 5 0.649238 2
7 5 0.767545 0
8 6 0.678478 2
9 6 0.815916 0
EDIT:
g = df.groupby('a')
cond = pd.Series(df.index==g['b'].transform('idxmin')) & g['a'].transform('count').gt(1)
df.loc[cond,'c'] = 2
print(df)
a b c
0 1 1 0
1 2 1 2
2 2 2 0
3 3 1 2
4 3 2 0
5 4 1 0
6 5 1 2
7 5 1 0
8 6 2 0
9 6 1 2
Upvotes: 1