Pandas Datetime: Aggregate based on time lag

Question

I have a data frame with 3 columns, containing id, timestamp and event type:

    id      timestamp   event_type
    ___________________________
0   1       2019-10-01      E1
1   1       2019-10-03      E3
2   2       2019-10-04      E3
3   2       2019-10-05      E4
4   2       2019-10-06      E1
5   1       2019-10-07      E3
6   1       2019-10-07      E4
7   1       2019-10-13      E3
8   2       2019-10-22      E5

And I am looking for a way to aggregate it, so that rows belonging to the same id and having a time lag < X days are aggregated together. Lets say X=3 for the examples sake.

I.e. row 0 and row 1 should result to one list, as their timestamps are not more than 3 days apart.

So my desired output is the following:

    id2     event_hist
    _______________
0   1-1     [E1, E3]
1   2-1     [E3, E4, E1]
2   1-2     [E3, E4]
3   1-3     [E3]
4   2-2     [E5]

the id2 column is just the id from the first data frame, iterated up for each new sequence.

I could write a function to achieve the desired result, but is there a built-in method? whats the most pythonic way the get the desired output?

splash58 · Accepted Answer

if column timestamp is not datetime, start with that

df['timestamp'] = pd.to_datetime(df['timestamp'])

t = df.groupby('id').apply(lambda g: g.rolling('3d', on='timestamp').count())
new = df.groupby(t['id'].le(t.shift()['id']).cumsum()) \
        .agg(event_hist=('event_type', list), id2=('id', 'first'))
new['id2'] = new['id2'].astype(str) + \
                '-' + \
                new.groupby('id2').cumcount().add(1).astype(str)

results in

            hist  id2
id                   
0       [E1, E3]  1-1
1   [E3, E4, E1]  2-1
2       [E3, E4]  1-2
3           [E3]  1-3
4           [E5]  2-2

Pandas Datetime: Aggregate based on time lag

Answers (2)

Related Questions