clog14
clog14

Reputation: 1641

Pandas form groups by closeness of timestamps

I have a dataframe generated by the following code:

l_dates = ['2017-01-01 19:53:36',
           '2017-01-01 19:54:36',
           '2017-01-03 18:15:13',
           '2017-01-03 18:18:11',
           '2017-01-03 18:44:35',
           '2017-01-07 12:50:48']

l_ids = list(range(len(l_dates)))

l_values = [x*1000-1 for x in l_ids]

l_data = list(zip(l_dates, l_ids, l_values))

df1_ = pd.DataFrame(data = l_data, columns = ['timeStamp', 'usageid', 'values'])

It looks as follows in this version

             timeStamp  usageid  values
0  2017-01-01 19:53:36        0      -1
1  2017-01-01 19:54:36        1     999
2  2017-01-03 18:15:13        2    1999
3  2017-01-03 18:18:11        3    2999
4  2017-01-03 18:44:35        4    3999
5  2017-01-07 12:50:48        5    4999

I would like to form groups based on observations that are closely together. For instance, all observations that are within a 15 minute time interval should be grouped together.

I know that I can identify these kinds of observations in a pairwise fashion as follows

df_user10241['timeStamp']  < pd.Timedelta(minutes=15)

However, I do not manage to group them s.t. I get a dataframe like the following:

             timeStamp  usageid  values   session
0  2017-01-01 19:53:36        0      -1  Session1
1  2017-01-01 19:54:36        1     999  Session1
2  2017-01-03 18:15:13        2    1999  Session2
3  2017-01-03 18:18:11        3    2999  Session2
4  2017-01-03 18:44:35        4    3999  Session3
5  2017-01-07 12:50:48        5    4999  Session4

Many thanks in advance and please let me know in case you need further information.

Upvotes: 6

Views: 836

Answers (1)

BENY
BENY

Reputation: 323276

You need cumsum

'Session'+(df.timeStamp.diff().fillna(0)/np.timedelta64(15, 'm')).gt(1).cumsum().add(1).astype(str)
Out[959]: 
0    Session1
1    Session1
2    Session2
3    Session2
4    Session3
5    Session4
Name: timeStamp, dtype: object

After assign it back

df['Session']='Session'+(df.timeStamp.diff().fillna(0)/np.timedelta64(15, 'm')).gt(1).cumsum().add(1).astype(str)
df
Out[961]: 
            timeStamp  usageid  values   Session
0 2017-01-01 19:53:36        0      -1  Session1
1 2017-01-01 19:54:36        1     999  Session1
2 2017-01-03 18:15:13        2    1999  Session2
3 2017-01-03 18:18:11        3    2999  Session2
4 2017-01-03 18:44:35        4    3999  Session3
5 2017-01-07 12:50:48        5    4999  Session4

Upvotes: 4

Related Questions