dloudtrain
dloudtrain

Reputation: 301

Assign unique id for each group based on time difference between rows in R

I have a dataframe [df] like this:

marker_id   timestamp
A           2020-10-01 07:32:14
A           2020-10-01 07:34:09
A           2020-10-01 10:13:00
A           2020-10-05 06:55:04
A           2020-10-05 06:59:14
A           2020-10-09 17:44:02
B           2020-10-01 13:58:47
B           2020-10-02 04:11:38
B           2020-10-02 04:15:07

I would like to know whether there is a way of assigning groups to individual timestamp sequences for each marker_id. The sequence should contain rows where the difference between 2 rows in less than 1 hour.

The desired result would look like this:

marker_id   timestamp              group_id
A           2020-10-01 07:32:14    1
A           2020-10-01 07:34:09    1
A           2020-10-01 10:13:00    2
A           2020-10-05 06:55:04    3
A           2020-10-05 06:59:14    3
A           2020-10-09 17:44:02    4
B           2020-10-01 13:58:47    5
B           2020-10-02 04:11:38    6
B           2020-10-02 04:15:07    6

Is there any way of doing this? Thank you for any suggestions.

Upvotes: 1

Views: 173

Answers (1)

Paul
Paul

Reputation: 9087

Here is some dummy data.

library(tidyverse)

df <- tribble(
~marker_id,   ~timestamp,
'A',           '2020-10-01 07:32:14',
'A',           '2020-10-01 07:34:09',
'A',           '2020-10-01 10:13:00',
'A',           '2020-10-05 06:55:04',
'A',           '2020-10-05 06:59:14',
'A',           '2020-10-09 17:44:02',
'B',           '2020-10-01 13:58:47',
'B',           '2020-10-02 04:11:38',
'B',           '2020-10-02 04:15:07'
) %>%
  mutate(timestamp = lubridate::as_datetime(timestamp))

Use lag to get the difference beween rows and convert that to hours. group_id can be computed by checking if it is >= 1 and getting the cumsum.

df %>%
  mutate(
    group_id = cumsum(
      abs(as.numeric(timestamp - lag(timestamp, default = min(timestamp)), units = "hours")) >= 1
    ) + 1
  )
#> # A tibble: 9 x 3
#>   marker_id timestamp           group_id
#>   <chr>     <dttm>                 <dbl>
#> 1 A         2020-10-01 07:32:14        1
#> 2 A         2020-10-01 07:34:09        1
#> 3 A         2020-10-01 10:13:00        2
#> 4 A         2020-10-05 06:55:04        3
#> 5 A         2020-10-05 06:59:14        3
#> 6 A         2020-10-09 17:44:02        4
#> 7 B         2020-10-01 13:58:47        5
#> 8 B         2020-10-02 04:11:38        6
#> 9 B         2020-10-02 04:15:07        6

Upvotes: 2

Related Questions