Jihong
Jihong

Reputation: 163

Resetting Cumulative Sum once a value is reached and set a flag to 1

I'm having trouble coming up with a way to perform a cumulative sum on a column and creating a flag once it hits a certain value.

So given a dataframe:

df = pd.DataFrame([[5,1],[6,1],[30,1],[170,0],[5,1],[10,1]],columns = ['a','b'])

     a  b
0    5  1
1    6  1
2   30  1
3  170  0
4    5  1
5   10  1

For column A, I want to perform the cumulative sum and set the "Flag" column value to 1 if that max value is reached. Upon reaching that max value, it will reset to 0. In this case, the max value is 40. Any cumulative sum over 40 will trigger the reset

Desired Output

     a  b  Flag
0    5  1     0
1   11  1     0
2   41  1     1
3  170  0     1
4    5  1     0
5   15  1     0

Any help would be appreciated!

Upvotes: 4

Views: 1880

Answers (3)

ombk
ombk

Reputation: 2111

df = pd.DataFrame([[5,1],[6,1],[30,1],[170,0],[5,1],[10,1], [25,1],[42,1],[41,5],[5,1]],columns = ['a','b'])
def cumsum_reset(df, sum_col="a"):
    flag_collector = []
    sum_collector = []
    c_reset_value = df.index.min()
    for i in df.index:
        sum_current_range = sum(df.loc[c_reset_value:i,sum_col])
        if sum_current_range >= 40:
            flag_collector.append(True)
            c_reset_value = i + 1
        else:
            flag_collector.append(False)
        sum_collector.append(sum_current_range)
    df["Flag"] = flag_collector
    df["sum_cum"] = sum_collector
    return df
     a  b   Flag  sum_cum
0    5  1  False        5
1    6  1  False       11
2   30  1   True       41
3  170  0   True      170
4    5  1  False        5
5   10  1  False       15
6   25  1   True       40
7   42  1   True       42
8   41  5   True       41
9    5  1  False        5

Dumbest way to do it. Added some extra values for extra checks.

Upvotes: 0

KingOtto
KingOtto

Reputation: 1483

There is a very straightforward, vectorized answer to this.

You are asking for each row where the integer division of the cumsum by 40 of up to its own value of a is not the same as that same division result of the preceding row. In other words: If, from one row to the next, the integer division of cusum by 40 is different, then set the flag to true.

integer_division = df['a'].cumsum()//40
df['flag'] = (integer_division != integer_division.shift(1).fillna(0)).astype(int)

We proceed in two steps:

  1. calculate the cumsum, and the integer division result, save in a temporary series integer_division
  2. make the "row to row" comparison by shifting that said series by +1

I disagree with the other comments and answers: "Ordinary cumsum" is just perfect - the restart happens wherever the integer division result changes - that's where you want the restart. Your 'threshold' is "any change in result of the //40 operation". Indeed the outcome is what we were looking for:

print(df)
     a  b  flag
0    5  1     0
1    6  1     0
2   30  1     1
3  170  0     1
4    5  1     0
5   10  1     0

A final comment: From the question it seems to me that there is a bit of interpretation room whether - when restarting - whether to reset the cumsum to the "cumsum value" or the "df maximum" that was reached.. The answer here restarts at the cumsum value. Regardless, also if restart at the "df value" is desired, it can be vectorized by buffering the non-integer division part (np.mod), and subtracting it from the actual cumsum value. Also that adjustment does not require a "running memory".

Upvotes: 3

Valdi_Bo
Valdi_Bo

Reputation: 30971

"Ordinary" cumsum() is here useless, as this function "doesn't know" where to restart summation.

You can do it with the following custom function:

def myCumSum(x, thr):
    if myCumSum.prev >= thr:
        myCumSum.prev = 0
    myCumSum.prev += x
    return myCumSum.prev

This function is "with memory" (from the previous call) - prev, so there is a way to "know" where to restart.

To speed up the execution, define a vectorized version of this function:

myCumSumV = np.vectorize(myCumSum, otypes=[np.int], excluded=['thr'])

Then execute:

threshold = 40
myCumSum.prev = 0  # Set the "previous" value
# Replace "a" column with your cumulative sum
df.a = myCumSumV(df.a.values, threshold)
df['flag'] = df.a.ge(threshold).astype(int)  # Compute "flag" column

The result is:

     a  b  flag
0    5  1     0
1   11  1     0
2   41  1     1
3  170  0     1
4    5  1     0
5   15  1     0

Upvotes: 4

Related Questions