IronKirby
IronKirby

Reputation: 708

Pandas GroupBy with CumulativeSum per Contiguous Groups

I'm looking to understand the number of times we are in an 'Abnormal State' before we have an 'Event'. My objective is to modify my dataframe to get the following output where everytime we reach an 'event', the 'Abnormal State Grouping' resets to count from 0.

We can go through a number of 'Abnormal States' before we reach an 'Event', which is deemed a failure. (i.e. The lightbulb is switched on and off for several periods before it finally shorts out resulting in an event).

I've written the following code to get my AbnormalStateGroupings to increment into relevant groupings for my analysis which has worked fine. However, we want to 'reset' the count of our 'AbnormalStates' after each event (i.e. lightbulb failure):

dataframe['AbnormalStateGrouping'] = (dataframe['AbnormalState']!=dataframe['AbnormalState'].shift()).cumsum()

I have created an additional column which let's me know what 'event' we are at via:

dataframe['Event_Or_Not'].cumsum() #I have a boolean representation of the Event Column represented and we use .cumsum() to get the relevant groupings (i.e. 1st Event, 2nd Event, 3rd Event etc.)

I've come close previously using the following:

eventOrNot = dataframe['Event'].eq(0)
eventMask = (eventOrNot.ne(eventOrNot.shift())&eventOrNot).cumsum()
dataframe['AbnormalStatePerEvent'] =dataframe.groupby(['Event',eventMask]).cumcount().add(1)

However, this hasn't given me the desired output that I'm after (as per below). I think I'm close however - Could anyone please advise what I could try to do next so that for each lightbulb failure, the abnormal state count resets and starts counting the # of abnormal states we have gone through before the next lightbulb failure?

State I want to get to with AbnormalStateGrouping

You would note that when an 'Event' is detected, the Abnormal State count resets to 1 and then starts counting again.

End State

Current State of Dataframe

Current State of Dataframe

Please find an attached data source below: https://filebin.net/ctjwk7p3gulmbgkn

Upvotes: 1

Views: 218

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30971

I assume that your source DataFrame has only Date/Time (either string or datetime), Event (string) and AbnormalState (int) columns.

To compute your grouping column, run:

dataframe['AbnormalStateGrouping'] = dataframe.groupby(
    dataframe['Event'][::-1].notnull().cumsum()).AbnormalState\
    .apply(lambda grp: (grp != grp.shift()).cumsum())

The result, for your initial source data, included as a picture, is:

           Date/Time  Event  AbnormalState  AbnormalStateGrouping
0   2018-01-01 01:00    NaN              0                      1
1   2018-01-01 02:00    NaN              0                      1
2   2018-01-01 03:00    NaN              1                      2
3   2018-01-01 04:00    NaN              1                      2
4   2018-01-01 05:00    NaN              0                      3
5   2018-01-01 06:00    NaN              0                      3
6   2018-01-01 07:00    NaN              0                      3
7   2018-01-01 08:00    NaN              1                      4
8   2018-01-01 09:00    NaN              1                      4
9   2018-01-01 10:00    NaN              0                      5
10  2018-01-01 11:00    NaN              0                      5
11  2018-01-01 12:00    NaN              0                      5
12  2018-01-01 13:00    NaN              1                      6
13  2018-01-01 14:00    NaN              1                      6
14  2018-01-01 15:00    NaN              0                      7
15  2018-01-01 16:00  Event              0                      7
16  2018-01-01 17:00    NaN              1                      1
17  2018-01-01 18:00    NaN              1                      1
18  2018-01-01 19:00    NaN              0                      2
19  2018-01-01 20:00    NaN              0                      2

Note the way of grouping:

dataframe['Event'][::-1].notnull().cumsum()

Due to [::-1], cumsum function is computed from the last row to the first. Thus:

  • rows with hours 01:00 thru 16:00 are in group 1,
  • remaining rows (hour 17:00 thru 20:00) are in group 0.

Then, to AbnormalState, separately for each group, a lambda function is applied, so each cumulative sum starts from 1 just in each group (after each Event).

Edit following the comment as of 22:18:12Z

The reason why I compute the cumsum for grouping in reversed order is that when you run it in normal order:

dataframe['Event'].notnull().cumsum()

then:

  • rows with index 0 thru 14 (before the row with Event) have this sum == 0,
  • row with index 15 and following rows have this sum == 1.

Try yourself both versions, without and with [::-1].

The result in normal order (without [::-1]) is that:

  • Event row is in the same group with the following rows,
  • so the reset occurs just on this row.

To check the whole result, run my code without [::-1] and you will see that the ending part of the result contains:

             Date/Time  Event  AbnormalState  AbnormalStateGrouping
14 2018-01-01 15:00:00    NaN              0                      7
15 2018-01-01 16:00:00  Event              0                      1
16 2018-01-01 17:00:00    NaN              1                      2
17 2018-01-01 18:00:00    NaN              1                      2
18 2018-01-01 19:00:00    NaN              0                      3
19 2018-01-01 20:00:00    NaN              0                      3

so that the Event row to has AbnormalStateGrouping == 1.

But you want this row to have AbnormalStateGrouping in a sequence of previous grouping states (in this case 7) and reset should occur from the next row on.

So the Event row should be in same group with preceding rows, what is the result of my code.

Upvotes: 2

Related Questions