Reputation: 23
I have DataFrame which consist of 3 columns: CustomerId, Amount and Status(success or failed). The DataFrame is not sorted in any way. A CustomerId can repeat multiple times in DataFrame.
I want to introduce new columns into this DataFrame with below logic:
df[totalamount]= sum of amount for each customer where status was success.
I already have a running code but with df.iterrows which takes too much time. Thus requesting you to kindly provide alternate methods like pandas vectorization or numpy vectorization.
For Example, I want to create the 'totalamount' column from the first three columns:
CustomerID Amount Status totalamount
0 1 5 Success 105 # since both transatctions were successful
1 2 10 Failed 80 # since one transaction was successful
2 3 50 Success 50
3 1 100 Success 105
4 2 80 Success 80
5 4 60 Failed 0
Upvotes: 2
Views: 50
Reputation: 554
df_new = df.groupby(['CustomerID', 'Status'], sort=False)['Amount'].sum().reset_index()
df_new = (df_new[df_new['Status'] == 'Success']
.drop(columns='Status')
.rename(columns={'Amount': 'totalamount'}))
df = pd.merge(df, df_new , on=['CustomerID'], how='left')
I'm not sure at all but I think this may work
Upvotes: 1
Reputation: 59549
Use where
to mask the 'Failed' rows with NaN
while preserving the length of the DataFrame. Then groupby
the CustomerID and transform
the sum of 'Amount' column to bring the result back to every row.
df['totalamount'] = (df.where(df['Status'].eq('Success'))
.groupby(df['CustomerID'])['Amount']
.transform('sum'))
CustomerID Amount Status totalamount
0 1 5 Success 105.0
1 2 10 Faled 80.0
2 3 50 Success 50.0
3 1 100 Success 105.0
4 2 80 Success 80.0
5 4 60 Failed 0.0
The reason for using where
(as opposed to subsetting the DataFrame) is because groupby + sum defaults to sum an entirely NaN
group to 0, so we don't need anything extra to deal with CustomerID 4, for instance.
Upvotes: 5