ayaankazerouni
ayaankazerouni

Reputation: 15

Group pandas events by start and end events

I have events that look like the following:

   |  time             |  type
-----------------------------------
1  |  2016-08-26 20:57 |  start
2  |  2016-08-26 20:28 |  event 1
3  |  2016-08-26 21:00 |  end
4  |  2016-08-26 21:27 |  start
5  |  2016-08-26 21:30 |  event 2
6  |  2016-08-26 21:32 |  event 3
7  |  2016-08-26 21:33 |  end

I would like to group the above data into 'sessions' bounded by start and end events. So rows 1 through 3 would be in one session, and rows 4 through 7 would be in another session. Note that start and end also count as events, so I do not want to eliminate them from the final result.

The only thing I've come up with is to create a new column in which I assign each row a "session id", and then simply groupby on that column. I could loop over the events and do it manually, but I was wondering if there's a vectorized way to do it.

This other answer does something similar. It groups events that occur within 6 hours of each other. In my case, I can't look at the previous row to decide if a new event should be in a new session or the current one, so I'm back to trying to avoid looping.

Any help is appreciated.

Upvotes: 1

Views: 580

Answers (1)

DYZ
DYZ

Reputation: 57033

Create a new column (e.g., 'session') that is equal to the row number only if the 'type' is 'start':

df.loc[df.type=='start', 'session'] = df.index[df.type=='start']

Forward fill the undefined rows of the new column with the most recent defined row values:

df.session = df.session.fillna(method='ffill').astype(int)
#               time     type  session
#1  2016-08-26 20:57    start      1
#2  2016-08-26 20:28  event 1      1
#3  2016-08-26 21:00      end      1
#4  2016-08-26 21:27    start      4
#5  2016-08-26 21:30  event 2      4
#6  2016-08-26 21:32  event 3      4
#7  2016-08-26 21:33      end      4

Now, each row belongs to a unique group.

Upvotes: 2

Related Questions