Hossein
Hossein

Reputation: 106

Pandas: Conditional cumsum based on previous row value of another column

I want to cumsum a column but the values of each row before summation must be checked with the value of another column and if the value of the other column is smaller, that value add up instead of the previous row of the first column. Consider this dataframe:

df = pd.DataFrame({'X': [0,1,0,1,1,0,0,0,0,1,1,1,0,1], 'Y': [0, 0, 1, 1, 1, 2, 3, 4, 4, 4, 4, 4, 5, 5]})
    X   Y
0   0   0
1   1   0
2   0   1
3   1   1
4   1   1
5   0   2
6   0   3
7   0   4
8   0   4
9   1   4
10  1   4
11  1   4
12  0   5
13  1   5

Now, I want to cumsum X, but if Y is smaller that X in the previous row, it will add up instead of X. For example, the cumsum result for the second row of X is 1 but because 0 in second row of Y is less than 1, we replace it. Then the cumsum of the third row is going to be 0 instead of 1. I wrote the code using "for loop" as follows, but it is not efficient at all for large datasets:

df['Z'] = 0
for index in range(1,len(df)):
    df.loc[index, 'Z'] = min(df.loc[index, 'X']+df.loc[index-1, 'Z'], df.loc[index, 'Y'])

The expected result for Z would be:

    X   Y   Z
0   0   0   0
1   1   0   0
2   0   1   0
3   1   1   1
4   1   1   1
5   0   2   1
6   0   3   1
7   0   4   1
8   0   4   1
9   1   4   2
10  1   4   3
11  1   4   4
12  0   5   4
13  1   5   5

I would appreciate it of someone can suggest a more efficient way.

Upvotes: 1

Views: 1449

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30971

As the result of your "specialized cumsum" depends of the previous result, you can't use the actual cumsum function.

Instead you should use a function "with memory" (remembering the previous returned value) and using it in the next invocation (for the next row).

Row 0 is here a special case. Since for row 0 there is no previous value of Z column, even in your code you left the result for the first row as 0, so I did just the same in my function (see below).

Other rows are computed according to your algorithm.

To compute your "specialized cumsum", define the following function:

def myCumSum(row):
    if row.name == 0:
        myCumSum.prev = 0
    else:
        myCumSum.prev = min(row.X + myCumSum.prev, row.Y)
    return myCumSum.prev

A bit of explanation concerning row.name: It is actually the index of the current row and my solution relies on the fact that the source DataFrame has the default index, i.e. consecutive numbers starting from 0.

Then apply it to each row and save the result in a new column (Z):

df['Z'] = df.apply(myCumSum, axis=1)

The result is:

    X  Y  Z
0   0  0  0
1   1  0  0
2   0  1  0
3   1  1  1
4   1  1  1
5   0  2  1
6   0  3  1
7   0  4  1
8   0  4  1
9   1  4  2
10  1  4  3
11  1  4  4
12  0  5  4
13  1  5  5

Upvotes: 1

Related Questions