Panda Group by sum specific columns and keep other columns

Question

I am trying to do something which should be simple... I have a dataframe and I am trying to group by a specific column, the output dataframe needs to keep one column and sum two others columns. Below is an example:

df:

User  Col1ToSum  Col2ToSum   ColToKeep
ABC     10         440         1.015
ABC     15         410         1.015
ABC     15        -200         1.015
ABA     100        110         2.24 
ABA     80        -10          2.24
AAA     40         10          nan
AAA     20         10          nan
BBB     10         15          nan
XYZ     10         10          1.1
XYZ     10         10          1.5 (note the 'ColToKeep' is different for user XYZ)

Expected Output:

User  Col1ToSum  Col2ToSum   ColToKeep
ABC     45         650         1.015
ABA     180        100         2.24
AAA     60         20          nan
BBB     10         15          nan
XYZ     10         10          1.1
XYZ     10         10          1.5

I tried to do:

df.groupby(['User', 'ColToKeep'], as_index=False).mean() Unfortunately, it is removing any records where ColToKeep is nan and the two columns to sum are not sum.
df.groupby(['User'], as_index=False)['Col1ToSum', 'Col2ToSum'].sum() Unfortunately, it is removing 'ColToKeep' column in the output df
df.groupby(['User', 'ColToKeep'], as_index=False)['Col1ToSum', 'Col2ToSum'].sum() but same as point 1) it is removing records where ColToKeep is nan

The reason why I tried to group by User and ColToKeep is because I want to have 2 records if User is the same BUT ColToKeep is different.

I am not sure why the group by would remove nan values (maybe due to my lack of experience with panda...)

Would you be able to please help on the above or even explain why it is happening ?

jezrael · Accepted Answer

Pandas support missing values in groupby from 1.1 version, link.

First idea is create new helper column new with replace missing values to some string, e.g. miss, then grouping by new with aggregate by GroupBy.agg with GroupBy.first, last remove helper level by first reset_index:

df = (df.assign(new= df['ColToKeep'].fillna('miss'))
       .groupby(['User', 'new'], sort=False)
       .agg({'Col1ToSum':'sum', 'Col2ToSum':'sum', 'ColToKeep':'first'})
       .reset_index(level=1, drop=True)
       .reset_index())
print (df)
  User  Col1ToSum  Col2ToSum  ColToKeep
0  ABC         40        650      1.015
1  ABA        180        100      2.240
2  AAA         60         20        NaN
3  BBB         10         15        NaN
4  XYZ         10         10      1.100
5  XYZ         10         10      1.500

Another idea is replace back miss to NaNs:

df = (df.assign(ColToKeep = df['ColToKeep'].fillna('miss'))
       .groupby(['User', 'ColToKeep'], sort=False)[['Col1ToSum', 'Col2ToSum']].sum()
       .reset_index()
       .replace({'ColToKeep': {'miss':np.nan}}))
print (df)
  User  ColToKeep  Col1ToSum  Col2ToSum
0  ABC      1.015         40        650
1  ABA      2.240        180        100
2  AAA        NaN         60         20
3  BBB        NaN         10         15
4  XYZ      1.100         10         10
5  XYZ      1.500         10         10

Panda Group by sum specific columns and keep other columns

Answers (2)

Related Questions