Reputation: 1296
I have two separate problems where I want to set values in a dataframe column based on values in other columns of that row or other rows that are related (ie. other rows that have some column value in common with said rows column value).
Starting DF
acct no product type amount product sum product id
0 100 1 35 0.0 NaN
1 100 2 45 0.0 NaN
2 100 2 50 0.0 NaN
3 101 4 10 0.0 NaN
4 101 4 1 0.0 NaN
5 102 5 70 0.0 NaN
6 102 6 90 0.0 NaN
In one problem (for work), I would like to group by (in this case on two columns), take the sum of values listed in a third column, then write this sum value to a here-to-for unset column on all of the rows that were grouped-by
ex (account-product group by and sum as product sum):
acct no product type amount product sum product id
0 100 1 35 35.0 NaN
1 100 2 45 95.0 NaN
2 100 2 50 95.0 NaN
3 101 4 10 11.0 NaN
4 101 4 1 11.0 NaN
5 102 5 70 70.0 NaN
6 102 6 90 90.0 NaN
In another problem (for 'fun'), I would like to create a 'product_id' that is based on another column value 'account number' and follows the convention account_number_i where i is the ith product for that account
ex (iterate through account groups and write to product id as account iter):
acct no product type amount product sum product id
0 100 1 35 0.0 100-1
1 100 2 45 0.0 100-2
2 100 2 50 0.0 100-3
3 101 4 10 0.0 101-1
4 101 4 1 0.0 101-2
5 102 5 70 0.0 102-1
6 102 6 90 0.0 102-2
In trying to do both of these, I have started by iterating through a sorted dataframe and utilizing functions like group by and sum, but it is about as slow as you would imagine. I am not nearly as comfortable with the pandas syntax as with python and one reason I am asking is to get a better grip on the best way to do what seems like a pretty common/ important task, updating empty rows based on related rows. I have used pandas for running averages and know this is fast, but what I want is a little more esoteric.
Upvotes: 0
Views: 206
Reputation: 22503
Don't iterate through your dataframe - 99% of the times there are better options.
Your problems can be easily solved by using groupby
with transform
, and cumcount
:
df["product sum"] = df.groupby(["acct no","product type"])["amount"].transform('sum')
df["product id"] = df["acct no"].astype(str)+"-"+(df.groupby("acct no").cumcount()+1).astype(str)
print (df)
acct no product type amount product sum product id
0 100 1 35 35 100-1
1 100 2 45 95 100-2
2 100 2 50 95 100-3
3 101 4 10 11 101-1
4 101 4 1 11 101-2
5 102 5 70 70 102-1
6 102 6 90 90 102-2
Upvotes: 3