tandem
tandem

Reputation: 2238

grouping by timestamp differences in pandas

I've the following dataframe

   NUMBER           TIMESTAMP
0      67 2020-03-06 12:06:25
1      67 2019-09-20 16:21:45
2      67 2019-09-17 17:54:40
3      67 2019-08-21 19:59:30
4      67 2019-08-13 19:40:26
5      67 2019-06-19 19:45:12
6      67 2019-04-30 20:46:03
7      67 2019-04-29 20:46:03

I am trying to group the NUMBER column by timestamps which have less than 30 (can be any arbitrary number) day time difference I am expecting the output to be:

  NUMBER           TIMESTAMP
0      67_4   2020-03-06 12:06:25
1      67_3   2019-09-20 16:21:45
2      67_3   2019-09-17 17:54:40
3      67_3   2019-08-21 19:59:30
4      67_3   2019-08-13 19:40:26
5      67_2   2019-06-19 19:45:12
6      67_1   2019-04-30 20:46:03
7      67_1   2019-04-29 20:46:03

Taking a simple diff on the timestamp like this or even ceil on the series doesn't help

#df['diff'] = df.groupby('NUMBER')['TIMESTAMP'].diff(-1).dt.days
#mask = df['diff'] > 30
#
#df.loc[mask, 'NUMBER'] = df\
#                            .loc[mask, 'NUMBER']\
#                            .astype(str) \
#                            + '_' \
#                            + df[mask]\
#                            .groupby('NUMBER')\
#                            .cumcount().add(1).astype(str)

This is the result if we take a simple diff

  NUMBER           TIMESTAMP   diff
0   67_1 2020-03-06 12:06:25  167.0
1     67 2019-09-20 16:21:45    2.0
2     67 2019-09-17 17:54:40   26.0
3     67 2019-08-21 19:59:30    8.0
4   67_2 2019-08-13 19:40:26   54.0
5   67_3 2019-06-19 19:45:12   49.0
6     67 2019-04-30 20:46:03    1.0
7     67 2019-04-29 20:46:03    NaN

Any help will be appreciated.

Upvotes: 0

Views: 57

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150745

Let's try comparing differences with the thresh hold and cumsum:

thresh = 30
dt_thresh = pd.to_timedelta(thresh, unit='D')

df['NUMBER'] = df['NUMBER'].astype(str) + '_' +  \
              (df.TIMESTAMP.sort_values()
                 .groupby(df.NUMBER)        # added groupby here
                 .diff().gt(dt_thresh)
                 .groupby(df.NUMBER)        # and groupby here
                 .cumsum().add(1).astype(str)
              )

Output:

  NUMBER           TIMESTAMP
0   67_4 2020-03-06 12:06:25
1   67_3 2019-09-20 16:21:45
2   67_3 2019-09-17 17:54:40
3   67_3 2019-08-21 19:59:30
4   67_3 2019-08-13 19:40:26
5   67_2 2019-06-19 19:45:12
6   67_1 2019-04-30 20:46:03
7   67_1 2019-04-29 20:46:03

Upvotes: 1

Related Questions