Kangaroo4you
Kangaroo4you

Reputation: 23

Find all groups of contiguous timestamp and assigns a unique id to each group in a data frame

I want to write a function that assigns a unique ID to all groups of contiguous timeframe, where 'contiguous' means all the observation within the group are no more that 'max_time_gap' seconds apart.

e.g: def assign_groups(df, max_time_gap, group_col_name):

Where df is the input data as a pandas data frame, max_time_gap is the maximum time in seconds between timestamps, and group_col_name is the destination column containing the assigned group ids.

Most of data appear at regular intervals, but there are some gaps.

clients connections grants bytes updates writes timestamp
0 0 12333 322 23323 2433 6543 2019-04-18 21:06:07.260973
1 122 8777 766 4556 8755 766.4 2019-04-18 21:06:07.344215
2 655 75566 4567 9877.6 56.6 778 2019-04-18 21:06:07.436656
3 454 453 32.5 4553 344 3422 2019-04-18 21:06:07.525223
4 3222 113 342 5644 4566 233.3 2019-04-18 21:06:09.622103
5 3122 113 42 2644 4536 333.3 2019-04-18 21:06:10.622103
6 3222 133 342 5644 4516 133.3 2019-04-18 21:06:10.722103

...

I tried some of pandas groupby operations such as 'df.groupby.Datetime.diff.gt(pd.Timedelta(seconds=max_time_gap)...'. They did not work.

Thank you for your help!

Upvotes: 2

Views: 373

Answers (1)

SeaBean
SeaBean

Reputation: 23217

You can do it this way:

1) Convert the column timestamp to datetime format if not already in that format

df['timestamp'] = pd.to_datetime(df['timestamp'])

2) Define the function as follows:

Use pd.Grouper() for grouping by the max_time_gap intervals. Then, take the group numbers by GroupBy.ngroup() and re-serialize these group numbers by .cumsum() when the group number changes.

def assign_groups(df, max_time_gap, group_col_name):
    df['_temp'] = df.groupby(pd.Grouper(key='timestamp', freq=max_time_gap, origin='start')).ngroup()
    df[group_col_name] = df['_temp'].ne(df['_temp'].shift()).cumsum()
    df.drop('_temp', axis=1, inplace=True)

max_time_gap is defined in format the same as the frequency string of pd.Grouper(), e.g. '1s' for 1 second; '500ms' for 500 milliseconds, etc. You can refer to the official document for more example of this string and also the Offset aliases for more frequency options (e.g. 'H' for hourly frequency, 'T' for minutely frequency, etc.)

3) Call the function as follows:

assign_groups(df, '500ms', 'Group_ID')     # for 500 milli-seconds max_time_gap

Result:

New column of Group_ID at the right of the dataframe.

print(df)

   clients  connections  grants    bytes  updates  writes                  timestamp  Group_ID
0        0        12333   322.0  23323.0   2433.0  6543.0 2019-04-18 21:06:07.260973         1
1      122         8777   766.0   4556.0   8755.0   766.4 2019-04-18 21:06:07.344215         1
2      655        75566  4567.0   9877.6     56.6   778.0 2019-04-18 21:06:07.436656         1
3      454          453    32.5   4553.0    344.0  3422.0 2019-04-18 21:06:07.525223         1
4     3222          113   342.0   5644.0   4566.0   233.3 2019-04-18 21:06:09.622103         2
5     3122          113    42.0   2644.0   4536.0   333.3 2019-04-18 21:06:10.622103         3
6     3222          133   342.0   5644.0   4516.0   133.3 2019-04-18 21:06:10.722103         3

Edit

If need more accurate grouping, we need to reset the origin / base time for each group to use the first timestamp of the group. As such, we could define the function as follows:

def assign_groups(df, max_time_gap, group_col_name):
    max_time_gap_ns = pd.to_timedelta(max_time_gap) / pd.Timedelta(nanoseconds=1)

    last_time = df['timestamp'][0]
    gap_tot_time_ns = 0
    group_no = 1
    
    for i, t in df['timestamp'].iteritems():
        gap_tot_time_ns += (t - last_time) / pd.Timedelta(nanoseconds=1)
        if gap_tot_time_ns > max_time_gap_ns:
            group_no +=1
            gap_tot_time_ns = 0
        df.loc[i, group_col_name] = group_no
        last_time = t

    df[group_col_name] = df[group_col_name].astype(int)

Call the function similarly as follows:

assign_groups(df, '1s', 'Group_ID')     # for 1 second max_time_gap

Result:

New column of Group_ID at the right of the dataframe.

print(df)

   clients  connections  grants    bytes  updates  writes                  timestamp  Group_ID
0        0        12333   322.0  23323.0   2433.0  6543.0 2019-04-18 21:06:07.260973         1
1      122         8777   766.0   4556.0   8755.0   766.4 2019-04-18 21:06:07.344215         1
2      655        75566  4567.0   9877.6     56.6   778.0 2019-04-18 21:06:07.436656         1
3      454          453    32.5   4553.0    344.0  3422.0 2019-04-18 21:06:07.525223         1
4     3222          113   342.0   5644.0   4566.0   233.3 2019-04-18 21:06:09.622103         2
5     3122          113    42.0   2644.0   4536.0   333.3 2019-04-18 21:06:10.622103         2
6     3222          133   342.0   5644.0   4516.0   133.3 2019-04-18 21:06:10.862103         3

Upvotes: 1

Related Questions