Reputation:

Create column that groups a sequence of consecutive dates with ID in R

I have been trying to find a solution to this and am unsure if it would require multiple steps using dplyr and/or creating some kind of function. Essentially, I want to create an "event" column with variables that link a sequence/repeat dates and ID together and indicate the event order.

before:

event date        id         
2020-12-25        111       
2020-12-26        111        
2020-12-27        111        
2020-12-30        222        
2020-12-30        222        
2021-01-30        333        
2021-01-31        333        
2021-05-25        222        
2021-05-26        222  
2021-07-08        222
2021-07-09        222

after:

event date        id         event
2020-12-25        111        1
2020-12-26        111        1
2020-12-27        111        1
2020-12-30        222        1
2020-12-30        222        1
2021-01-30        333        1
2021-01-31        333        1
2021-05-25        222        2
2021-05-26        222        2
2021-07-08        222        3
2021-07-09        222        3

The process to assign event values is essentially if the id is linked to a sequence of dates that are consecutive, then this can be grouped into an event. If the id is linked to multiple consecutive dates/individual dates, then that is a new and separate event associated with that id.

edit: added more data points to provide more info

Upvotes: 2

Answers (2)

r2evans

Reputation: 160447

We can capitalize on data.table::rleid for this.

(I converted your event_date to a Date-class before running these.)

data.table

library(data.table)
as.data.table(dat)[, r1 := rleid(id, cumsum(c(FALSE, diff(event_date) > 1)))
  ][, event2 := match(r1, unique(r1)), by = id
  ][, r1 := NULL ]
#    event_date  id event event2
# 1: 2020-12-25 111     1      1
# 2: 2020-12-26 111     1      1
# 3: 2020-12-27 111     1      1
# 4: 2020-12-30 222     1      1
# 5: 2020-12-30 222     1      1
# 6: 2021-01-30 333     1      1
# 7: 2021-01-31 333     1      1
# 8: 2021-05-25 222     2      2
# 9: 2021-05-26 222     2      2

For non-data.table applications where you don't have it installed, you can use

my_rleid <- function(...) {
  r <- rle(do.call(paste, c(list(...), sep = "_")))$lengths
  rep(seq_along(r), times = r)
}

dplyr

library(dplyr)
dat %>%
  mutate(r1 = my_rleid(id, cumsum(c(FALSE, diff(event_date) > 1)))) %>%
  group_by(id) %>%
  mutate(event2 = match(r1, unique(r1))) %>%
  ungroup() %>%
  select(-r1)

base R

r1 <- my_rleid(dat$id, cumsum(c(FALSE, diff(dat$event_date) > 1)))
dat$event2 <- ave(r1, dat$id, FUN = function(z) match(z, unique(z)))
dat$r1 <- NULL

Data

dat <- structure(list(event_date = structure(c(18621, 18622, 18623, 18626, 18626, 18657, 18658, 18772, 18773), class = "Date"), id = c(111L, 111L, 111L, 222L, 222L, 333L, 333L, 222L, 222L), event = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L)), row.names = c(NA, -9L), class = "data.frame")

Upvotes: 1

Ronak Shah

Reputation: 388982

You can do this with the help of diff and cumsum -

library(dplyr)

df <- df %>%
  mutate(event_date = as.Date(event_date)) %>%
  group_by(id) %>%
  mutate(event = cumsum(c(TRUE, diff(event_date) > 1))) %>%
  ungroup

df

#  event_date    id event
#  <date>     <int> <int>
#1 2020-12-25   111     1
#2 2020-12-26   111     1
#3 2020-12-27   111     1
#4 2020-12-30   222     1
#5 2020-12-30   222     1
#6 2021-01-30   333     1
#7 2021-01-31   333     1
#8 2021-05-25   222     2
#9 2021-05-26   222     2

For each id, event value is incremented when the difference between consecutive dates is greater than 1.