Efficiently updating dataframe rows on the fly

Question

I have two separate problems where I want to set values in a dataframe column based on values in other columns of that row or other rows that are related (ie. other rows that have some column value in common with said rows column value).

Starting DF

   acct no  product type  amount  product sum  product id
0      100             1      35          0.0         NaN
1      100             2      45          0.0         NaN
2      100             2      50          0.0         NaN
3      101             4      10          0.0         NaN
4      101             4       1          0.0         NaN
5      102             5      70          0.0         NaN
6      102             6      90          0.0         NaN

In one problem (for work), I would like to group by (in this case on two columns), take the sum of values listed in a third column, then write this sum value to a here-to-for unset column on all of the rows that were grouped-by

ex (account-product group by and sum as product sum):

   acct no  product type  amount  product sum  product id
0      100             1      35          35.0         NaN
1      100             2      45          95.0         NaN
2      100             2      50          95.0         NaN
3      101             4      10          11.0         NaN
4      101             4       1          11.0         NaN
5      102             5      70          70.0         NaN
6      102             6      90          90.0         NaN

In another problem (for 'fun'), I would like to create a 'product_id' that is based on another column value 'account number' and follows the convention account_number_i where i is the ith product for that account

ex (iterate through account groups and write to product id as account iter):

   acct no  product type  amount  product sum  product id
0      100             1      35          0.0         100-1
1      100             2      45          0.0         100-2
2      100             2      50          0.0         100-3
3      101             4      10          0.0         101-1
4      101             4       1          0.0         101-2
5      102             5      70          0.0         102-1
6      102             6      90          0.0         102-2

In trying to do both of these, I have started by iterating through a sorted dataframe and utilizing functions like group by and sum, but it is about as slow as you would imagine. I am not nearly as comfortable with the pandas syntax as with python and one reason I am asking is to get a better grip on the best way to do what seems like a pretty common/ important task, updating empty rows based on related rows. I have used pandas for running averages and know this is fast, but what I want is a little more esoteric.

Henry Yik · Accepted Answer

Don't iterate through your dataframe - 99% of the times there are better options.

Your problems can be easily solved by using groupby with transform, and cumcount:

df["product sum"] = df.groupby(["acct no","product type"])["amount"].transform('sum')
df["product id"] = df["acct no"].astype(str)+"-"+(df.groupby("acct no").cumcount()+1).astype(str)

print (df)

   acct no  product type  amount  product sum product id
0      100             1      35           35      100-1
1      100             2      45           95      100-2
2      100             2      50           95      100-3
3      101             4      10           11      101-1
4      101             4       1           11      101-2
5      102             5      70           70      102-1
6      102             6      90           90      102-2

Efficiently updating dataframe rows on the fly

Answers (1)

Related Questions