Find all groups of contiguous timestamp and assigns a unique id to each group in a data frame

Question

I want to write a function that assigns a unique ID to all groups of contiguous timeframe, where 'contiguous' means all the observation within the group are no more that 'max_time_gap' seconds apart.

e.g: def assign_groups(df, max_time_gap, group_col_name):

Where df is the input data as a pandas data frame, max_time_gap is the maximum time in seconds between timestamps, and group_col_name is the destination column containing the assigned group ids.

Most of data appear at regular intervals, but there are some gaps.

	clients	connections	grants	bytes	updates	writes	timestamp
0	0	12333	322	23323	2433	6543	2019-04-18 21:06:07.260973
1	122	8777	766	4556	8755	766.4	2019-04-18 21:06:07.344215
2	655	75566	4567	9877.6	56.6	778	2019-04-18 21:06:07.436656
3	454	453	32.5	4553	344	3422	2019-04-18 21:06:07.525223
4	3222	113	342	5644	4566	233.3	2019-04-18 21:06:09.622103
5	3122	113	42	2644	4536	333.3	2019-04-18 21:06:10.622103
6	3222	133	342	5644	4516	133.3	2019-04-18 21:06:10.722103

...

I tried some of pandas groupby operations such as 'df.groupby.Datetime.diff.gt(pd.Timedelta(seconds=max_time_gap)...'. They did not work.

Thank you for your help!

SeaBean · Accepted Answer

You can do it this way:

1) Convert the column timestamp to datetime format if not already in that format

df['timestamp'] = pd.to_datetime(df['timestamp'])

2) Define the function as follows:

Use pd.Grouper() for grouping by the max_time_gap intervals. Then, take the group numbers by GroupBy.ngroup() and re-serialize these group numbers by .cumsum() when the group number changes.

def assign_groups(df, max_time_gap, group_col_name):
    df['_temp'] = df.groupby(pd.Grouper(key='timestamp', freq=max_time_gap, origin='start')).ngroup()
    df[group_col_name] = df['_temp'].ne(df['_temp'].shift()).cumsum()
    df.drop('_temp', axis=1, inplace=True)

max_time_gap is defined in format the same as the frequency string of pd.Grouper(), e.g. '1s' for 1 second; '500ms' for 500 milliseconds, etc. You can refer to the official document for more example of this string and also the Offset aliases for more frequency options (e.g. 'H' for hourly frequency, 'T' for minutely frequency, etc.)

3) Call the function as follows:

assign_groups(df, '500ms', 'Group_ID')     # for 500 milli-seconds max_time_gap

Result:

New column of Group_ID at the right of the dataframe.

print(df)

   clients  connections  grants    bytes  updates  writes                  timestamp  Group_ID
0        0        12333   322.0  23323.0   2433.0  6543.0 2019-04-18 21:06:07.260973         1
1      122         8777   766.0   4556.0   8755.0   766.4 2019-04-18 21:06:07.344215         1
2      655        75566  4567.0   9877.6     56.6   778.0 2019-04-18 21:06:07.436656         1
3      454          453    32.5   4553.0    344.0  3422.0 2019-04-18 21:06:07.525223         1
4     3222          113   342.0   5644.0   4566.0   233.3 2019-04-18 21:06:09.622103         2
5     3122          113    42.0   2644.0   4536.0   333.3 2019-04-18 21:06:10.622103         3
6     3222          133   342.0   5644.0   4516.0   133.3 2019-04-18 21:06:10.722103         3

Edit

If need more accurate grouping, we need to reset the origin / base time for each group to use the first timestamp of the group. As such, we could define the function as follows:

def assign_groups(df, max_time_gap, group_col_name):
    max_time_gap_ns = pd.to_timedelta(max_time_gap) / pd.Timedelta(nanoseconds=1)

    last_time = df['timestamp'][0]
    gap_tot_time_ns = 0
    group_no = 1
    
    for i, t in df['timestamp'].iteritems():
        gap_tot_time_ns += (t - last_time) / pd.Timedelta(nanoseconds=1)
        if gap_tot_time_ns > max_time_gap_ns:
            group_no +=1
            gap_tot_time_ns = 0
        df.loc[i, group_col_name] = group_no
        last_time = t

    df[group_col_name] = df[group_col_name].astype(int)

Call the function similarly as follows:

assign_groups(df, '1s', 'Group_ID')     # for 1 second max_time_gap

Result:

New column of Group_ID at the right of the dataframe.

print(df)

   clients  connections  grants    bytes  updates  writes                  timestamp  Group_ID
0        0        12333   322.0  23323.0   2433.0  6543.0 2019-04-18 21:06:07.260973         1
1      122         8777   766.0   4556.0   8755.0   766.4 2019-04-18 21:06:07.344215         1
2      655        75566  4567.0   9877.6     56.6   778.0 2019-04-18 21:06:07.436656         1
3      454          453    32.5   4553.0    344.0  3422.0 2019-04-18 21:06:07.525223         1
4     3222          113   342.0   5644.0   4566.0   233.3 2019-04-18 21:06:09.622103         2
5     3122          113    42.0   2644.0   4536.0   333.3 2019-04-18 21:06:10.622103         2
6     3222          133   342.0   5644.0   4516.0   133.3 2019-04-18 21:06:10.862103         3

Find all groups of contiguous timestamp and assigns a unique id to each group in a data frame

Answers (1)

Edit

Related Questions