Thanh Nguyen
Thanh Nguyen

Reputation: 912

Pandas cumulative sum reset based on percentage of the last checkpoint

Sample Data

date_rng = pd.date_range('2019-01-01', freq='s', periods=400)
df = pd.DataFrame(np.random.lognormal(.005, .5, size=(len(date_rng))),
                  columns=['data1'],
                  index=date_rng)

Example input

                    data1
2019-01-01 00:00:00 1
2019-01-01 00:00:01 -2
2019-01-01 00:00:02 2
2019-01-01 00:00:03 3
2019-01-01 00:00:04 1
2019-01-01 00:00:05 2
2019-01-01 00:00:06 -1
2019-01-01 00:00:07 3
2019-01-01 00:00:08 4
2019-01-01 00:00:09 5
2019-01-01 00:00:10 7
2019-01-01 00:00:11 2
2019-01-01 00:00:12 4
2019-01-01 00:00:13 -1
2019-01-01 00:00:14 5
2019-01-01 00:00:15 3
2019-01-01 00:00:16 5
2019-01-01 00:00:17 -3
... ...

Expected output

                    data1 cumsum
2019-01-01 00:00:00 1     1
2019-01-01 00:00:01 -2    -1
2019-01-01 00:00:02 2     1
2019-01-01 00:00:03 3     4
2019-01-01 00:00:04 1     5 (reset cumsum at this point)
2019-01-01 00:00:05 2     2
2019-01-01 00:00:06 -1    1
2019-01-01 00:00:07 3     4
2019-01-01 00:00:08 4     8 (reset at this point)
2019-01-01 00:00:09 5     5 (reset at this point)
2019-01-01 00:00:10 7     7 (reset at this point)
2019-01-01 00:00:11 2     2
2019-01-01 00:00:12 4     6 (reset at this point)
2019-01-01 00:00:13 -1    -1
2019-01-01 00:00:14 5     4
2019-01-01 00:00:15 3     7 (reset at this point)
2019-01-01 00:00:16 1     1
2019-01-01 00:00:17 -3    -2
... ...

I want to calculate the cumulative sum of the data1 based on the condtiontion that it will reset if it larger than 5 and exceed 20% of the last reset. For example: for the first calculation, it will only reset if it's larger than 5, after that, it will reset based on both conditions.

I have checked out some other answers on stackoverflow but I didn't find similar problems. Please advice me on how to solve this problem.

Upvotes: 2

Views: 140

Answers (1)

ResidentSleeper
ResidentSleeper

Reputation: 2495

You can use a simple loop with if else and some logic to solve this problem.

ls = []  # store result
cumsum = 0
last_reset = 0
for _, row in df.iterrows():
    cumsum = cumsum + row.data1
    ls.append(cumsum)
    if cumsum >= 5 and cumsum > 0.2*last_reset:
        last_reset = cumsum
        cumsum = 0

df['cumsum'] = ls

df
                     data1  cumsum
2019-01-01 00:00:00      1       1
2019-01-01 00:00:01     -2      -1
2019-01-01 00:00:02      2       1
2019-01-01 00:00:03      3       4
2019-01-01 00:00:04      1       5
2019-01-01 00:00:05      2       2
2019-01-01 00:00:06     -1       1
2019-01-01 00:00:07      3       4
2019-01-01 00:00:08      4       8
2019-01-01 00:00:09      5       5
2019-01-01 00:00:10      7       7
2019-01-01 00:00:11      2       2
2019-01-01 00:00:12      4       6
2019-01-01 00:00:13     -1      -1
2019-01-01 00:00:14      5       4
2019-01-01 00:00:15      3       7
2019-01-01 00:00:16      1       1
2019-01-01 00:00:17     -3      -2

Upvotes: 1

Related Questions