Reputation: 4601
I have customer records with id, timestamp and status.
ID, TS, STATUS
1 10 GOOD
1 20 GOOD
1 25 BAD
1 30 BAD
1 50 BAD
1 600 GOOD
2 40 GOOD
.. ...
I am trying to calculate how much time is spent in consecutive BAD statuses (lets imagine order above is correct) per customer. So for customer id=1, 30-25,50-30,600-50 in total 575 seconds was spent in BAD status.
What is the method of doing this in Pandas? If I calculate .diff() on TS, that would give me differences, but how can I tie that 1) to the customer 2) certain status "blocks" for that customer?
Sample data:
df = pandas.DataFrame({'ID':[1,1,1,1,1,1,2],
'TS':[10,20,25,30,50,600,40],
'Status':['G','G','B','B','B','G','G']
},
columns=['ID','TS','Status'])
Thanks,
Upvotes: 1
Views: 474
Reputation: 49886
Here's a solution to separately aggregate each contiguous block of bad status (part 2 of your question?).
In [5]: df = pandas.DataFrame({'ID':[1,1,1,1,1,1,1,1,2,2,2],
'TS':[10,20,25,30,50,600,650,670,40,50,60],
'Status':['G','G','B','B','B','G','B','B','G','B','B']
},
columns=['ID','TS','Status'])
In [6]: grp = df.groupby('ID')
In [7]: def status_change(df):
...: return (df.Status.shift(1) != df.Status).astype(int)
...:
In [8]: df['BlockId'] = grp.apply(lambda df: status_change(df).cumsum())
In [9]: df['Duration'] = grp.TS.diff().shift(-1)
In [10]: df
Out[10]:
ID TS Status BlockId Duration
0 1 10 G 1 10
1 1 20 G 1 5
2 1 25 B 2 5
3 1 30 B 2 20
4 1 50 B 2 550
5 1 600 G 3 50
6 1 650 B 4 20
7 1 670 B 4 NaN
8 2 40 G 1 10
9 2 50 B 2 10
10 2 60 B 2 NaN
In [11]: df[df.Status == 'B'].groupby(['ID', 'BlockId']).Duration.sum()
Out[11]:
ID BlockId
1 2 575
4 20
2 2 10
Name: Duration
Upvotes: 2
Reputation: 40638
In [1]: df = DataFrame({'ID':[1,1,1,1,1,2,2],'TS':[10,20,25,30,50,10,40],'Stat
us':['G','G','B','B','B','B','B']}, columns=['ID','TS','Status'])
In [2]: f = lambda x: x.diff().sum()
In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['TS'].transform(f)
In [4]: df
Out[4]:
ID TS Status diff
0 1 10 G NaN
1 1 20 G NaN
2 1 25 B 25
3 1 30 B 25
4 1 50 B 25
5 2 10 B 30
6 2 40 B 30
Explanation:
Subset the dataframe
to only those records with the desired Status. Groupby
the ID and apply the lambda function diff().sum()
to each group. Use transform
instead of apply
because transform
returns an indexed series which you can use to assign to a new column 'diff'.
EDIT: New response to account for expanded question scope.
In [1]: df
Out[1]:
ID TS Status
0 1 10 G
1 1 20 G
2 1 25 B
3 1 30 B
4 1 50 B
5 1 600 G
6 2 40 G
In [2]: df['shift'] = -df['TS'].diff(-1)
In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['shift'].transform('sum')
In [4]: df
Out[4]:
ID TS Status shift diff
0 1 10 G 10 NaN
1 1 20 G 5 NaN
2 1 25 B 5 575
3 1 30 B 20 575
4 1 50 B 550 575
5 1 600 G -560 NaN
6 2 40 G NaN NaN
Upvotes: 2