Reputation: 260
I am trying to find the time elapsed between two events using Grouper but was unable to do so. Please help me out. Below is the i/p & expected o/p
Input
ID Status Datetime
A Online 24/09/2017 7:00:00 AM
A Offline 24/09/2017 7:30:00 AM
A Offline 24/09/2017 8:30:00 AM
A Online 24/09/2017 9:30:00 AM
A Offline 24/09/2017 10:00:00 AM
B Offline 24/09/2017 6:00:00 AM
B Online 24/09/2017 7:30:00 AM
B Online 24/09/2017 9:10:00 AM
B Offline 24/09/2017 9:30:00 AM
B Online 24/09/2017 9:40:00 AM
B Offline 24/09/2017 10:00:00 AM
Output
ID Hour_start Hour_end Online_time
A 24/09/2017 7:00:00 AM 24/09/2017 8:00:00 AM 1800
A 24/09/2017 8:00:00 AM 24/09/2017 9:00:00 AM 0
A 24/09/2017 9:00:00 AM 24/09/2017 10:00:00 AM 1800
B 24/09/2017 6:00:00 AM 24/09/2017 7:00:00 AM 0
B 24/09/2017 7:00:00 AM 24/09/2017 8:00:00 AM 1800
B 24/09/2017 8:00:00 AM 24/09/2017 9:00:00 AM 3600
B 24/09/2017 9:00:00 AM 24/09/2017 10:00:00 AM 3000
Using Pandas Grouper
df_output = df.groupby(['ID',pd.Grouper(key='Datetime', freq='H'),'status'])['event_time'].diff().dt.seconds.fillna(0)
But this doesn't take into the condition of Online & Offline of the Status column
Please help me out. TIA
Upvotes: 0
Views: 358
Reputation: 30971
I assume that Datetime column in your source DataFrame is of datetime64 type.
My solution is based on 2-level grouping, first by ID and then (after some intermediate operations) by hour.
Define 2 functions:
onTimeById, to compute Online time for each ID (the "external" grouping level):
def onTimeById(grp):
wrk = grp[grp.Status != grp.Status.shift()]
wrk = wrk.set_index('Datetime').Status
wrk = wrk.reindex(wrk.index.union(pd.date_range(wrk.index.min(),
wrk.index.max(), freq='H'))).ffill()
res = wrk.groupby(pd.Grouper(freq='H')).apply(onTimeByHour)
rv = res.iloc[:-1].reset_index().rename(columns={'index': 'Hour_start',
'Status': 'Online_time'})
rv.insert(1, 'Hour_end', res.index[1:])
return rv
onTimeByHour, to compute Online time for each hour (the "internal" grouping level):
def onTimeByHour(grp2):
if grp2.size > 1:
dd = grp2.index.to_series().diff()
rv = dd[grp2 == 'Offline'].sum().seconds
if grp2.iloc[-1] == 'Online':
rv += 3600 - dd.sum().seconds
return rv
return 0 if grp2.iloc[0] == 'Offline' else 3600
Then run:
res = df.groupby('ID').apply(onTimeById).reset_index(level=0).reset_index(drop=True)
The result, for your source data, is:
ID Hour_start Hour_end Online_time
0 A 2017-09-24 07:00:00 2017-09-24 08:00:00 1800
1 A 2017-09-24 08:00:00 2017-09-24 09:00:00 0
2 A 2017-09-24 09:00:00 2017-09-24 10:00:00 1800
3 B 2017-09-24 06:00:00 2017-09-24 07:00:00 0
4 B 2017-09-24 07:00:00 2017-09-24 08:00:00 1800
5 B 2017-09-24 08:00:00 2017-09-24 09:00:00 3600
6 B 2017-09-24 09:00:00 2017-09-24 10:00:00 3000
To understand all details how this solution works, save group "A" under a variable, e.g. running:
gr = df.groupby('ID')
grp = gr.get_group('A')
Then execute each instruction from onTimeById and see the results.
Apply the same approach to trace how onTimeByHour works.
Upvotes: 1