Reputation: 151
I want to add the particular columns (C, D, E, F, G) based on the duplicate rows of column B. Whereas the remaining non-duplicate rows unchanged. The output of column A must be the first index of duplicate rows.
I have a dataframe as follows:
A B C D E F G
box1 0487 1 1 1
box2 0487 1 1
blue 0478 1 1 1
gray 0478 1 1 1 1
gray 0478 1 1 1
flat 8704 1 1 1
clay 8704 1 1
dark 8740 1 1 1 1 1
late 4087 1 1 1
I want the output as follows:
A B C D E F G
box1 0487 1 1 1 1 1
blue 0478 2 2 2 2 2
flat 8704 1 1 1 2
dark 8740 1 1 1 1 1
late 4087 1 1 1
I am pleased to hear some suggestions.
Upvotes: 1
Views: 184
Reputation: 862641
Create dictionary of columns names with aggregate functions and pass to agg
, also here is necessary min_count=1
to sum
for avoid 0
for sum NaN
s values:
L = ['C','D','E','F','G']
d = {**dict.fromkeys(L, lambda x: x.sum(min_count=1)), **{'A':'first'}}
df = df.groupby('B', as_index=False, sort=False).agg(d).reindex(columns=df.columns)
print (df)
A B C D E F G
0 box1 0487 1.0 1.0 1.0 1.0 1.0
1 blue 0478 2.0 2.0 2.0 2.0 2.0
2 flat 8704 1.0 1.0 1.0 NaN 2.0
3 dark 8740 1.0 1.0 1.0 1.0 1.0
4 late 4087 1.0 NaN 1.0 NaN 1.0
d = {**dict.fromkeys(L, 'sum'), **{'A':'first'}}
df = df.groupby('B', as_index=False, sort=False).agg(d).reindex(columns=df.columns)
print (df)
A B C D E F G
0 box1 0487 1.0 1.0 1.0 1.0 1.0
1 blue 0478 2.0 2.0 2.0 2.0 2.0
2 flat 8704 1.0 1.0 1.0 0.0 2.0
3 dark 8740 1.0 1.0 1.0 1.0 1.0
4 late 4087 1.0 0.0 1.0 0.0 1.0
Upvotes: 7