Reputation: 163
I'm having trouble coming up with a way to perform a cumulative sum on a column and creating a flag once it hits a certain value.
So given a dataframe:
df = pd.DataFrame([[5,1],[6,1],[30,1],[170,0],[5,1],[10,1]],columns = ['a','b'])
a b
0 5 1
1 6 1
2 30 1
3 170 0
4 5 1
5 10 1
For column A, I want to perform the cumulative sum and set the "Flag" column value to 1 if that max value is reached. Upon reaching that max value, it will reset to 0. In this case, the max value is 40. Any cumulative sum over 40 will trigger the reset
Desired Output
a b Flag
0 5 1 0
1 11 1 0
2 41 1 1
3 170 0 1
4 5 1 0
5 15 1 0
Any help would be appreciated!
Upvotes: 4
Views: 1880
Reputation: 2111
df = pd.DataFrame([[5,1],[6,1],[30,1],[170,0],[5,1],[10,1], [25,1],[42,1],[41,5],[5,1]],columns = ['a','b'])
def cumsum_reset(df, sum_col="a"):
flag_collector = []
sum_collector = []
c_reset_value = df.index.min()
for i in df.index:
sum_current_range = sum(df.loc[c_reset_value:i,sum_col])
if sum_current_range >= 40:
flag_collector.append(True)
c_reset_value = i + 1
else:
flag_collector.append(False)
sum_collector.append(sum_current_range)
df["Flag"] = flag_collector
df["sum_cum"] = sum_collector
return df
a b Flag sum_cum
0 5 1 False 5
1 6 1 False 11
2 30 1 True 41
3 170 0 True 170
4 5 1 False 5
5 10 1 False 15
6 25 1 True 40
7 42 1 True 42
8 41 5 True 41
9 5 1 False 5
Dumbest way to do it. Added some extra values for extra checks.
Upvotes: 0
Reputation: 1483
There is a very straightforward, vectorized answer to this.
You are asking for each row where the integer division of the cumsum by 40 of up to its own value of a
is not the same as that same division result of the preceding row. In other words: If, from one row to the next, the integer division of cusum by 40 is different, then set the flag to true.
integer_division = df['a'].cumsum()//40
df['flag'] = (integer_division != integer_division.shift(1).fillna(0)).astype(int)
We proceed in two steps:
integer_division
I disagree with the other comments and answers: "Ordinary cumsum" is just perfect - the restart happens wherever the integer division result changes - that's where you want the restart. Your 'threshold' is "any change in result of the //40
operation". Indeed the outcome is what we were looking for:
print(df)
a b flag
0 5 1 0
1 6 1 0
2 30 1 1
3 170 0 1
4 5 1 0
5 10 1 0
A final comment: From the question it seems to me that there is a bit of interpretation room whether - when restarting - whether to reset the cumsum to the "cumsum value" or the "df maximum" that was reached.. The answer here restarts at the cumsum value. Regardless, also if restart at the "df value" is desired, it can be vectorized by buffering the non-integer division part (np.mod
), and subtracting it from the actual cumsum value. Also that adjustment does not require a "running memory".
Upvotes: 3
Reputation: 30971
"Ordinary" cumsum() is here useless, as this function "doesn't know" where to restart summation.
You can do it with the following custom function:
def myCumSum(x, thr):
if myCumSum.prev >= thr:
myCumSum.prev = 0
myCumSum.prev += x
return myCumSum.prev
This function is "with memory" (from the previous call) - prev, so there is a way to "know" where to restart.
To speed up the execution, define a vectorized version of this function:
myCumSumV = np.vectorize(myCumSum, otypes=[np.int], excluded=['thr'])
Then execute:
threshold = 40
myCumSum.prev = 0 # Set the "previous" value
# Replace "a" column with your cumulative sum
df.a = myCumSumV(df.a.values, threshold)
df['flag'] = df.a.ge(threshold).astype(int) # Compute "flag" column
The result is:
a b flag
0 5 1 0
1 11 1 0
2 41 1 1
3 170 0 1
4 5 1 0
5 15 1 0
Upvotes: 4