Reputation: 15
I have events that look like the following:
| time | type ----------------------------------- 1 | 2016-08-26 20:57 | start 2 | 2016-08-26 20:28 | event 1 3 | 2016-08-26 21:00 | end 4 | 2016-08-26 21:27 | start 5 | 2016-08-26 21:30 | event 2 6 | 2016-08-26 21:32 | event 3 7 | 2016-08-26 21:33 | end
I would like to group the above data into 'sessions' bounded by start
and end
events. So rows 1 through 3 would be in one session, and rows 4 through 7 would be in another session. Note that start
and end
also count as events, so I do not want to eliminate them from the final result.
The only thing I've come up with is to create a new column in which I assign each row a "session id", and then simply groupby
on that column. I could loop over the events and do it manually, but I was wondering if there's a vectorized way to do it.
This other answer does something similar. It groups events that occur within 6 hours of each other. In my case, I can't look at the previous row to decide if a new event should be in a new session or the current one, so I'm back to trying to avoid looping.
Any help is appreciated.
Upvotes: 1
Views: 580
Reputation: 57033
Create a new column (e.g., 'session') that is equal to the row number only if the 'type' is 'start':
df.loc[df.type=='start', 'session'] = df.index[df.type=='start']
Forward fill the undefined rows of the new column with the most recent defined row values:
df.session = df.session.fillna(method='ffill').astype(int)
# time type session
#1 2016-08-26 20:57 start 1
#2 2016-08-26 20:28 event 1 1
#3 2016-08-26 21:00 end 1
#4 2016-08-26 21:27 start 4
#5 2016-08-26 21:30 event 2 4
#6 2016-08-26 21:32 event 3 4
#7 2016-08-26 21:33 end 4
Now, each row belongs to a unique group.
Upvotes: 2