Reputation: 23
I want to write a function that assigns a unique ID to all groups of contiguous timeframe, where 'contiguous' means all the observation within the group are no more that 'max_time_gap' seconds apart.
e.g: def assign_groups(df, max_time_gap, group_col_name):
Where df is the input data as a pandas data frame, max_time_gap is the maximum time in seconds between timestamps, and group_col_name is the destination column containing the assigned group ids.
Most of data appear at regular intervals, but there are some gaps.
clients | connections | grants | bytes | updates | writes | timestamp | |
---|---|---|---|---|---|---|---|
0 | 0 | 12333 | 322 | 23323 | 2433 | 6543 | 2019-04-18 21:06:07.260973 |
1 | 122 | 8777 | 766 | 4556 | 8755 | 766.4 | 2019-04-18 21:06:07.344215 |
2 | 655 | 75566 | 4567 | 9877.6 | 56.6 | 778 | 2019-04-18 21:06:07.436656 |
3 | 454 | 453 | 32.5 | 4553 | 344 | 3422 | 2019-04-18 21:06:07.525223 |
4 | 3222 | 113 | 342 | 5644 | 4566 | 233.3 | 2019-04-18 21:06:09.622103 |
5 | 3122 | 113 | 42 | 2644 | 4536 | 333.3 | 2019-04-18 21:06:10.622103 |
6 | 3222 | 133 | 342 | 5644 | 4516 | 133.3 | 2019-04-18 21:06:10.722103 |
...
I tried some of pandas groupby operations such as 'df.groupby.Datetime.diff.gt(pd.Timedelta(seconds=max_time_gap)...'. They did not work.
Thank you for your help!
Upvotes: 2
Views: 373
Reputation: 23217
You can do it this way:
1) Convert the column timestamp
to datetime format if not already in that format
df['timestamp'] = pd.to_datetime(df['timestamp'])
2) Define the function as follows:
Use pd.Grouper()
for grouping by the max_time_gap
intervals. Then, take the group numbers by GroupBy.ngroup()
and re-serialize these group numbers by .cumsum()
when the group number changes.
def assign_groups(df, max_time_gap, group_col_name):
df['_temp'] = df.groupby(pd.Grouper(key='timestamp', freq=max_time_gap, origin='start')).ngroup()
df[group_col_name] = df['_temp'].ne(df['_temp'].shift()).cumsum()
df.drop('_temp', axis=1, inplace=True)
max_time_gap
is defined in format the same as the frequency string of pd.Grouper()
, e.g. '1s'
for 1 second; '500ms'
for 500 milliseconds, etc. You can refer to the official document for more example of this string and also the Offset aliases
for more frequency options (e.g. 'H'
for hourly frequency, 'T'
for minutely frequency, etc.)
3) Call the function as follows:
assign_groups(df, '500ms', 'Group_ID') # for 500 milli-seconds max_time_gap
Result:
New column of Group_ID
at the right of the dataframe.
print(df)
clients connections grants bytes updates writes timestamp Group_ID
0 0 12333 322.0 23323.0 2433.0 6543.0 2019-04-18 21:06:07.260973 1
1 122 8777 766.0 4556.0 8755.0 766.4 2019-04-18 21:06:07.344215 1
2 655 75566 4567.0 9877.6 56.6 778.0 2019-04-18 21:06:07.436656 1
3 454 453 32.5 4553.0 344.0 3422.0 2019-04-18 21:06:07.525223 1
4 3222 113 342.0 5644.0 4566.0 233.3 2019-04-18 21:06:09.622103 2
5 3122 113 42.0 2644.0 4536.0 333.3 2019-04-18 21:06:10.622103 3
6 3222 133 342.0 5644.0 4516.0 133.3 2019-04-18 21:06:10.722103 3
If need more accurate grouping, we need to reset the origin / base time for each group to use the first timestamp of the group. As such, we could define the function as follows:
def assign_groups(df, max_time_gap, group_col_name):
max_time_gap_ns = pd.to_timedelta(max_time_gap) / pd.Timedelta(nanoseconds=1)
last_time = df['timestamp'][0]
gap_tot_time_ns = 0
group_no = 1
for i, t in df['timestamp'].iteritems():
gap_tot_time_ns += (t - last_time) / pd.Timedelta(nanoseconds=1)
if gap_tot_time_ns > max_time_gap_ns:
group_no +=1
gap_tot_time_ns = 0
df.loc[i, group_col_name] = group_no
last_time = t
df[group_col_name] = df[group_col_name].astype(int)
Call the function similarly as follows:
assign_groups(df, '1s', 'Group_ID') # for 1 second max_time_gap
Result:
New column of Group_ID
at the right of the dataframe.
print(df)
clients connections grants bytes updates writes timestamp Group_ID
0 0 12333 322.0 23323.0 2433.0 6543.0 2019-04-18 21:06:07.260973 1
1 122 8777 766.0 4556.0 8755.0 766.4 2019-04-18 21:06:07.344215 1
2 655 75566 4567.0 9877.6 56.6 778.0 2019-04-18 21:06:07.436656 1
3 454 453 32.5 4553.0 344.0 3422.0 2019-04-18 21:06:07.525223 1
4 3222 113 342.0 5644.0 4566.0 233.3 2019-04-18 21:06:09.622103 2
5 3122 113 42.0 2644.0 4536.0 333.3 2019-04-18 21:06:10.622103 2
6 3222 133 342.0 5644.0 4516.0 133.3 2019-04-18 21:06:10.862103 3
Upvotes: 1