Shivam Singhal
Shivam Singhal

Reputation: 23

Speed Up Pandas Iterations

I have DataFrame which consist of 3 columns: CustomerId, Amount and Status(success or failed). The DataFrame is not sorted in any way. A CustomerId can repeat multiple times in DataFrame.

I want to introduce new columns into this DataFrame with below logic:

df[totalamount]= sum of amount for each customer where status was success.

I already have a running code but with df.iterrows which takes too much time. Thus requesting you to kindly provide alternate methods like pandas vectorization or numpy vectorization.

For Example, I want to create the 'totalamount' column from the first three columns:

   CustomerID  Amount   Status  totalamount
0           1       5  Success          105 # since both transatctions were successful
1           2      10   Failed           80 # since one transaction was successful
2           3      50  Success           50
3           1     100  Success          105
4           2      80  Success           80
5           4      60   Failed            0

Upvotes: 2

Views: 50

Answers (2)

Adept
Adept

Reputation: 554

df_new = df.groupby(['CustomerID', 'Status'], sort=False)['Amount'].sum().reset_index()
df_new = (df_new[df_new['Status'] == 'Success']
            .drop(columns='Status')
            .rename(columns={'Amount': 'totalamount'}))
df = pd.merge(df, df_new , on=['CustomerID'], how='left')

I'm not sure at all but I think this may work

Upvotes: 1

ALollz
ALollz

Reputation: 59549

Use where to mask the 'Failed' rows with NaN while preserving the length of the DataFrame. Then groupby the CustomerID and transform the sum of 'Amount' column to bring the result back to every row.

df['totalamount'] = (df.where(df['Status'].eq('Success'))
                       .groupby(df['CustomerID'])['Amount']
                       .transform('sum'))

   CustomerID  Amount   Status  totalamount
0           1       5  Success        105.0
1           2      10    Faled         80.0
2           3      50  Success         50.0
3           1     100  Success        105.0
4           2      80  Success         80.0
5           4      60   Failed          0.0

The reason for using where (as opposed to subsetting the DataFrame) is because groupby + sum defaults to sum an entirely NaN group to 0, so we don't need anything extra to deal with CustomerID 4, for instance.

Upvotes: 5

Related Questions