Reputation: 468
I am trying to do something which should be simple... I have a dataframe and I am trying to group by a specific column, the output dataframe needs to keep one column and sum two others columns. Below is an example:
df:
User Col1ToSum Col2ToSum ColToKeep
ABC 10 440 1.015
ABC 15 410 1.015
ABC 15 -200 1.015
ABA 100 110 2.24
ABA 80 -10 2.24
AAA 40 10 nan
AAA 20 10 nan
BBB 10 15 nan
XYZ 10 10 1.1
XYZ 10 10 1.5 (note the 'ColToKeep' is different for user XYZ)
Expected Output:
User Col1ToSum Col2ToSum ColToKeep
ABC 45 650 1.015
ABA 180 100 2.24
AAA 60 20 nan
BBB 10 15 nan
XYZ 10 10 1.1
XYZ 10 10 1.5
I tried to do:
df.groupby(['User', 'ColToKeep'], as_index=False).mean()
Unfortunately, it is removing any records where ColToKeep is nan and the two columns to sum are not sum.df.groupby(['User'], as_index=False)['Col1ToSum', 'Col2ToSum'].sum()
Unfortunately, it is removing 'ColToKeep' column in the output dfdf.groupby(['User', 'ColToKeep'], as_index=False)['Col1ToSum', 'Col2ToSum'].sum()
but same as point 1) it is removing records where ColToKeep is nanThe reason why I tried to group by User and ColToKeep is because I want to have 2 records if User is the same BUT ColToKeep is different.
I am not sure why the group by would remove nan values (maybe due to my lack of experience with panda...)
Would you be able to please help on the above or even explain why it is happening ?
Upvotes: 0
Views: 882
Reputation: 851
Change "ColToKeep" to a string, then use your code.
In : df['ColToKeep'] = df['ColToKeep'].astype(str)
In : df.groupby(['User','ColToKeep'], as_index=False).sum()
Out:
User ColToKeep Col1ToSum Col2ToSum
0 AAA nan 60 20
1 ABA 2.24 180 100
2 ABC 1.015 40 650
3 BBB nan 10 15
4 XYZ 1.1 10 10
5 XYZ 1.5 10 10
You can change the ColToKeep back to a float if necessary by:
df1 = df.groupby(['User','ColToKeep'], as_index=False).sum()
df1['ColToKeep'] = df1['ColToKeep'].astype(float)
print(df1)
User ColToKeep Col1ToSum Col2ToSum
0 AAA NaN 60 20
1 ABA 2.240 180 100
2 ABC 1.015 40 650
3 BBB NaN 10 15
4 XYZ 1.100 10 10
5 XYZ 1.500 10 10
Upvotes: 1
Reputation: 862406
Pandas support missing values in groupby
from 1.1
version, link.
First idea is create new helper column new
with replace missing values to some string, e.g. miss
, then grouping by new
with aggregate by GroupBy.agg
with GroupBy.first
, last remove helper level by first reset_index
:
df = (df.assign(new= df['ColToKeep'].fillna('miss'))
.groupby(['User', 'new'], sort=False)
.agg({'Col1ToSum':'sum', 'Col2ToSum':'sum', 'ColToKeep':'first'})
.reset_index(level=1, drop=True)
.reset_index())
print (df)
User Col1ToSum Col2ToSum ColToKeep
0 ABC 40 650 1.015
1 ABA 180 100 2.240
2 AAA 60 20 NaN
3 BBB 10 15 NaN
4 XYZ 10 10 1.100
5 XYZ 10 10 1.500
Another idea is replace back miss
to NaN
s:
df = (df.assign(ColToKeep = df['ColToKeep'].fillna('miss'))
.groupby(['User', 'ColToKeep'], sort=False)[['Col1ToSum', 'Col2ToSum']].sum()
.reset_index()
.replace({'ColToKeep': {'miss':np.nan}}))
print (df)
User ColToKeep Col1ToSum Col2ToSum
0 ABC 1.015 40 650
1 ABA 2.240 180 100
2 AAA NaN 60 20
3 BBB NaN 10 15
4 XYZ 1.100 10 10
5 XYZ 1.500 10 10
Upvotes: 3