BBSysDyn
BBSysDyn

Reputation: 4601

Taking Differences of Records When Status Changes - Pandas

I have customer records with id, timestamp and status.

ID, TS, STATUS
1 10 GOOD
1 20 GOOD
1 25 BAD
1 30 BAD
1 50 BAD
1 600 GOOD
2 40 GOOD
.. ... 

I am trying to calculate how much time is spent in consecutive BAD statuses (lets imagine order above is correct) per customer. So for customer id=1, 30-25,50-30,600-50 in total 575 seconds was spent in BAD status.

What is the method of doing this in Pandas? If I calculate .diff() on TS, that would give me differences, but how can I tie that 1) to the customer 2) certain status "blocks" for that customer?

Sample data:

df = pandas.DataFrame({'ID':[1,1,1,1,1,1,2],
                       'TS':[10,20,25,30,50,600,40],
                       'Status':['G','G','B','B','B','G','G']
                       },
                      columns=['ID','TS','Status'])

Thanks,

Upvotes: 1

Views: 474

Answers (2)

Garrett
Garrett

Reputation: 49886

Here's a solution to separately aggregate each contiguous block of bad status (part 2 of your question?).

In [5]: df = pandas.DataFrame({'ID':[1,1,1,1,1,1,1,1,2,2,2],
                               'TS':[10,20,25,30,50,600,650,670,40,50,60],
                               'Status':['G','G','B','B','B','G','B','B','G','B','B']
                               },
                               columns=['ID','TS','Status'])

In [6]: grp = df.groupby('ID')

In [7]: def status_change(df):
   ...:         return (df.Status.shift(1) != df.Status).astype(int)
   ...: 

In [8]: df['BlockId'] = grp.apply(lambda df: status_change(df).cumsum())

In [9]: df['Duration'] = grp.TS.diff().shift(-1)

In [10]: df
Out[10]: 
    ID   TS Status  BlockId  Duration
0    1   10      G        1        10
1    1   20      G        1         5
2    1   25      B        2         5
3    1   30      B        2        20
4    1   50      B        2       550
5    1  600      G        3        50
6    1  650      B        4        20
7    1  670      B        4       NaN
8    2   40      G        1        10
9    2   50      B        2        10
10   2   60      B        2       NaN

In [11]: df[df.Status == 'B'].groupby(['ID', 'BlockId']).Duration.sum()
Out[11]: 
ID  BlockId
1   2          575
    4           20
2   2           10
Name: Duration

Upvotes: 2

Zelazny7
Zelazny7

Reputation: 40638

In [1]: df = DataFrame({'ID':[1,1,1,1,1,2,2],'TS':[10,20,25,30,50,10,40],'Stat
us':['G','G','B','B','B','B','B']}, columns=['ID','TS','Status'])

In [2]: f = lambda x: x.diff().sum()

In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['TS'].transform(f)

In [4]: df
Out[4]:
   ID  TS Status  diff
0   1  10      G   NaN
1   1  20      G   NaN
2   1  25      B    25
3   1  30      B    25
4   1  50      B    25
5   2  10      B    30
6   2  40      B    30

Explanation: Subset the dataframe to only those records with the desired Status. Groupby the ID and apply the lambda function diff().sum() to each group. Use transform instead of apply because transform returns an indexed series which you can use to assign to a new column 'diff'.

EDIT: New response to account for expanded question scope.

In [1]: df
Out[1]:
   ID   TS Status
0   1   10      G
1   1   20      G
2   1   25      B
3   1   30      B
4   1   50      B
5   1  600      G
6   2   40      G

In [2]: df['shift'] = -df['TS'].diff(-1)

In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['shift'].transform('sum')
In [4]: df
Out[4]:
   ID   TS Status  shift  diff
0   1   10      G     10   NaN
1   1   20      G      5   NaN
2   1   25      B      5   575
3   1   30      B     20   575
4   1   50      B    550   575
5   1  600      G   -560   NaN
6   2   40      G    NaN   NaN

Upvotes: 2

Related Questions