Cumulative Sum With Reset Condition

Question

Hi so I have a dataframe and I would like to find the index whenever one of the column's cumulative sum is equal to a threshold. It will then reset and start the cumsum again.

For example:

    d = np.random.randn(10, 1) * 2
    df = pd.DataFrame(d.astype(int), columns=['data'])
    pd.concat([df,df.cumsum()],axis=1)

Outout:

Out[34]:

   data  data1
0     1     1
1     2     3
2     3     6
3     2     8
4     0     8
5     1     9
6     0     9
7    -1     8
8     1     9
9     2    11

So in the above sample data, data1 is the cumsum of column 1. If I set thres=5 this means that whenever the running sum of column 1 is greater than or equal to 5, I save the index. After that happens, the running sum resets and start again until the next running total sum is greater than or equal to 5 is reached.

Right now I am doing a loop and keeping to track the running sum an manually resetting. I was wondering if there is a fast vectorized way in pandas to do it as my dataframe is millions of rows long.

Thanks

Alain T. · Accepted Answer

I'm not familiar with pandas but my understanding is that it is based on numpy. Using numpy you can define custom functions that can be used with accumulate.

Here is one that I think is close to what you're looking for:

import numpy as np
def capsum(array,cap):
   capAdd = np.frompyfunc(lambda a,b:a+b if a < cap else b,2,1)
   return capAdd.accumulate(values, dtype=np.object)

values = np.random.rand(1000000) * 3 // 1

result = capsum(values,5)  # --> produces the result in 0.17 sec.

I believe (or I hope) you can use numpy functions on dataframes.

Cumulative Sum With Reset Condition

Answers (1)

Related Questions