MatBailie
MatBailie

Reputation: 86775

Pandas - Cumulative Sum, with resets, using GroupBy

I have a dataframe of several thousand timeseries.


For each timeseries, there is a unique timestamp, so I can enforce the order.


Each timeseries has a state column


I need to 'discard' all rows before there have been a certain number of consecutive 1's


So, here's some sample data...

test = pd.DataFrame({
    'group': [1,1,1,1,1,1,1, 2,2,2,2,2,2,2],
    'idx':   [0,1,2,3,4,5,6, 0,1,2,3,4,5,6],
    'value': [0,1,0,1,1,1,1, 0,1,1,1,0,1,0],
})

The results that I want are...

desired_result = pd.DataFrame({
    'group': [        1,1,1,     2,2,2,2,2],
    'idx':   [        4,5,6,     2,3,4,5,6],
    'value': [        1,1,1,     1,1,0,1,0],
})

What I think I need to calculate is...

test = pd.DataFrame({
    'group': [1,1,1,1,1,1,1, 2,2,2,2,2,2,2],
    'idx':   [0,1,2,3,4,5,6, 0,1,2,3,4,5,6],
    'value': [0,1,0,1,1,1,1, 0,1,1,1,0,1,0],
   #'consec':[0,1,0,1,2,3,4, 0,1,2,3,0,1,0], -- the cumulative sum of value, but resetting whenever a 0 is encountered
   #'max_c': [0,1,1,1,2,3,4, 0,1,2,3,3,3,3], -- the cumulative max of consec
   #                  ^ ^ ^      ^ ^ ^ ^ ^   -- rows I want to keep, as max_c >= 2
})

Then I can just take the rows where test[ test['max_c'] >= 2 ]


But, how do I calculate consec?




EDIT: My best attempt, but it feels ridiculously long winded...

test['cumsum'] = test.groupby(['group'])['value'].cumsum()

test['reset'] = test['cumsum'][ test.groupby(['group'])['value'].diff() == -1 ]
test['reset'] = test['reset'].fillna(0)

test['reset_cummax'] = test.groupby(['group'])['reset'].cummax()

test['consec'] = test['cumsum'] - test['reset_cummax']

test['c_max'] = test.groupby(['group'])['consec'].cummax()

Upvotes: 0

Views: 594

Answers (1)

Ben.T
Ben.T

Reputation: 29635

IIUC, you can do the cumsum after a groupby on the column group and every time the column value is eq to 0, you create a new group with cumsum as well.

test['consec'] = test.groupby(['group', test['value'].eq(0).cumsum()])['value'].cumsum()
test['max_c'] = test.groupby(['group'])['consec'].cummax()
print(test)
    group  idx  value  consec  max_c
0       1    0      0       0      0
1       1    1      1       1      1
2       1    2      0       0      1
3       1    3      1       1      1
4       1    4      1       2      2
5       1    5      1       3      3
6       1    6      1       4      4
7       2    0      0       0      0
8       2    1      1       1      1
9       2    2      1       2      2
10      2    3      1       3      3
11      2    4      0       0      3
12      2    5      1       1      3
13      2    6      0       0      3

Upvotes: 2

Related Questions