Reputation: 2238
I've the following dataframe
NUMBER TIMESTAMP
0 67 2020-03-06 12:06:25
1 67 2019-09-20 16:21:45
2 67 2019-09-17 17:54:40
3 67 2019-08-21 19:59:30
4 67 2019-08-13 19:40:26
5 67 2019-06-19 19:45:12
6 67 2019-04-30 20:46:03
7 67 2019-04-29 20:46:03
I am trying to group the NUMBER
column by timestamps which have less than 30 (can be any arbitrary number) day time difference
I am expecting the output to be:
NUMBER TIMESTAMP
0 67_4 2020-03-06 12:06:25
1 67_3 2019-09-20 16:21:45
2 67_3 2019-09-17 17:54:40
3 67_3 2019-08-21 19:59:30
4 67_3 2019-08-13 19:40:26
5 67_2 2019-06-19 19:45:12
6 67_1 2019-04-30 20:46:03
7 67_1 2019-04-29 20:46:03
Taking a simple diff on the timestamp like this or even ceil on the series doesn't help
#df['diff'] = df.groupby('NUMBER')['TIMESTAMP'].diff(-1).dt.days
#mask = df['diff'] > 30
#
#df.loc[mask, 'NUMBER'] = df\
# .loc[mask, 'NUMBER']\
# .astype(str) \
# + '_' \
# + df[mask]\
# .groupby('NUMBER')\
# .cumcount().add(1).astype(str)
This is the result if we take a simple diff
NUMBER TIMESTAMP diff
0 67_1 2020-03-06 12:06:25 167.0
1 67 2019-09-20 16:21:45 2.0
2 67 2019-09-17 17:54:40 26.0
3 67 2019-08-21 19:59:30 8.0
4 67_2 2019-08-13 19:40:26 54.0
5 67_3 2019-06-19 19:45:12 49.0
6 67 2019-04-30 20:46:03 1.0
7 67 2019-04-29 20:46:03 NaN
Any help will be appreciated.
Upvotes: 0
Views: 57
Reputation: 150745
Let's try comparing differences with the thresh hold and cumsum
:
thresh = 30
dt_thresh = pd.to_timedelta(thresh, unit='D')
df['NUMBER'] = df['NUMBER'].astype(str) + '_' + \
(df.TIMESTAMP.sort_values()
.groupby(df.NUMBER) # added groupby here
.diff().gt(dt_thresh)
.groupby(df.NUMBER) # and groupby here
.cumsum().add(1).astype(str)
)
Output:
NUMBER TIMESTAMP
0 67_4 2020-03-06 12:06:25
1 67_3 2019-09-20 16:21:45
2 67_3 2019-09-17 17:54:40
3 67_3 2019-08-21 19:59:30
4 67_3 2019-08-13 19:40:26
5 67_2 2019-06-19 19:45:12
6 67_1 2019-04-30 20:46:03
7 67_1 2019-04-29 20:46:03
Upvotes: 1