bharling
bharling

Reputation: 2850

Sum timedeltas between specific pairs of rows in a pandas dataframe

have been wrestling with this for a while and can't figure it out.

I've got some logs of user actions when watching a live broadcast on our product, and I need to be able to get a picture of the total time a user was watching the broadcast, subtracting any time they had the stream paused.

my dataframe looks like this (after some filtering)

                dateHourMinute             event      user
2    2020-05-01 14:35:00+01:00              play  clqj9026
5811 2020-05-01 14:45:00+01:00             pause  clqj9026 # -- exclude this
5812 2020-05-01 15:00:00+01:00              play  clqj9026 # -- timedelta
5846 2020-05-01 15:01:00+01:00              play  clqj9026
6147 2020-05-01 15:07:00+01:00             pause  clqj9026
6148 2020-05-01 15:07:00+01:00              play  clqj9026
6354 2020-05-01 15:20:00+01:00             pause  clqj9026
6355 2020-05-01 15:20:00+01:00              play  clqj9026
6392 2020-05-01 15:21:00+01:00              play  clqj9026
6505 2020-05-01 15:23:00+01:00             pause  clqj9026
6506 2020-05-01 15:23:00+01:00  stopped_watching  clqj9026

I want to sum the timedeltas between each pair of 'play/pause' events but avoid including gaps between pause/play events, assuming that the user had the stream closed at this point.

The example shows contiguous events but we have to assume that there are instances where the stream was paused and the user was doing something else. Also, I need to disregard instances of the same event occurring twice in sequence. I know I can do df.dateHourMinute.diff().sum() but this doesn't take into account the periods when the stream would be paused.

Secondly, is there a way to do this without iterating over the unique values in the user column to get the total viewing time per-user?#

EDIT: Changed the table above to show a gap where the stream was paused. To clarify the total view time for the table above should come out at 33 minutes (Note the period between the first 'pause' at 14:45 and the second 'play' event at 15:00, I want to exclude that time period).

Upvotes: 0

Views: 50

Answers (1)

Hamid
Hamid

Reputation: 612

try this:

df['dateHourMinute'] = pd.to_datetime(df['dateHourMinute'])
df = df.sort_values('dateHourMinute')
df['time_diff'] = df['dateHourMinute'].shift(-1) - df['dateHourMinute']
df = df[df['event']=='play']
print(df['time_diff'].sum())

The first line converts dateHourMinute to date time. The second line sorts the data by time. The third line subtracts time between two consecutive rows. Now you have time between each play and pause. Now you can do anything you want with the data! the last line adds all the time_diffs. which for this data is 0 days 00:48:00

Let me know if it is helpful.

Upvotes: 2

Related Questions