Separating non-overlapping intervals within groups and counting in R

Question

Using R, I have inpatient data that I have grouped by DNA strain (of the pathogen), clinic of inpatient stay, and overlapping timeframe of the stay to determine if transmission is possible.

I need to sequentially number the overlapping groups. This would appear quite simple, but two issues:

Everything I have found on SO or elsewhere talks about numbering rows within groups. I need each row in a group the same number and the groups themselves to be counted.
Whatever approach would accomplish that initially seemed simple enough with a %>% group_by(strain, clinic) %>%, but this doesn't account for non-overlapping time intervals

I have tried several approaches and search before finally giving in and posting here (none of my attempts are worthy of event posting here to waste your time.) The below code is an example of the data I have (have) and data I want (want). Note for strain B, all patients are in Clinic_1 but there are two groups due to a separation in time intervals.

Any advice would be much appreciated.

have <- data.frame(id=c("K01","K02","K03","K04","K05","K06","K07","K08","K09"),
                   strain=c(rep("A",4),rep("B",5)),
                   clinic=c(rep("Clinic_1",2),rep("Clinic_2",2),rep("Clinic_1",5)),
                   datein=as.Date(c("2020/01/01","2020/01/03","2020/02/03","2020/02/09","2020/02/18","2020/02/20","2020/02/21","2020/03/06","2020/03/18")),
                   dateout=as.Date(c("2020/01/05","2020/01/16","2020/02/09","2020/02/19","2020/02/27","2020/02/23","2020/02/22","2020/03/21","2020/03/22"))
                   )

want <- data.frame(have,overlap_number=c(1,1,2,2,3,3,3,4,4))

#How the final data would look
> View(want)
   id strain   clinic     datein    dateout overlap_number
1 K01      A Clinic_1 2020-01-01 2020-01-05              1
2 K02      A Clinic_1 2020-01-03 2020-01-16              1
3 K03      A Clinic_2 2020-02-03 2020-02-09              2
4 K04      A Clinic_2 2020-02-09 2020-02-19              2
5 K05      B Clinic_1 2020-02-18 2020-02-27              3
6 K06      B Clinic_1 2020-02-20 2020-02-23              3
7 K07      B Clinic_1 2020-02-21 2020-02-22              3
8 K08      B Clinic_1 2020-03-06 2020-03-21              4
9 K09      B Clinic_1 2020-03-18 2020-03-22              4

An alternative dataset based on Akrun's comment, changing dates slightly for K07:

have2 <- data.frame(id=c("K01","K02","K03","K04","K05","K06","K07","K08","K09"),
                   strain=c(rep("A",4),rep("B",5)),
                   clinic=c(rep("Clinic_1",2),rep("Clinic_2",2),rep("Clinic_1",5)),
                   datein=as.Date(c("2020/01/01","2020/01/03","2020/02/03","2020/02/09","2020/02/18","2020/02/20","2020/02/25","2020/03/06","2020/03/18")),
                   dateout=as.Date(c("2020/01/05","2020/01/16","2020/02/09","2020/02/19","2020/02/27","2020/02/23","2020/02/29","2020/03/21","2020/03/22"))
                   )

#Output:

#> have2 %>% 
#+   mutate(overlap_number = rleid(strain, clinic, 
#+                                 cumsum(datein > lag(dateout, default = #first(dateout)))))
#   id strain   clinic     datein    dateout overlap_number
#1 K01      A Clinic_1 2020-01-01 2020-01-05              1
#2 K02      A Clinic_1 2020-01-03 2020-01-16              1
#3 K03      A Clinic_2 2020-02-03 2020-02-09              2
#4 K04      A Clinic_2 2020-02-09 2020-02-19              2
#5 K05      B Clinic_1 2020-02-18 2020-02-27              3
#6 K06      B Clinic_1 2020-02-20 2020-02-23              3
#7 K07      B Clinic_1 2020-02-25 2020-02-29              4 ## treats this as single, should be 3
#8 K08      B Clinic_1 2020-03-06 2020-03-21              5 ## should be 4
#9 K09      B Clinic_1 2020-03-18 2020-03-22              5 ## should be 4

chinsoon12 · Accepted Answer

An option using data.table:

setkey(setDT(have), clinic, strain, datein, dateout)
have[, g := cumsum(c(0L, (shift(datein, -1L) > cummax(as.integer(dateout)))[-.N])), 
  .(clinic, strain)][, 
    g := rleid(clinic, strain, g)]

Also:

have[, g02 := cumsum(datein > shift(cummax(as.integer(dateout)), fill=dateout[1L])),
    .(clinic, strain)][,
        g2 := rleid(clinic, strain, g02)]

output:

    id strain   clinic     datein    dateout g g2
1: K01      A Clinic_1 2020-01-01 2020-01-05 1  1
2: K02      A Clinic_1 2020-01-03 2020-01-16 1  1
3: K05      B Clinic_1 2020-02-18 2020-02-27 2  2
4: K06      B Clinic_1 2020-02-20 2020-02-23 2  2
5: K07      B Clinic_1 2020-02-21 2020-02-22 2  2
6: K08      B Clinic_1 2020-03-06 2020-03-21 3  3
7: K09      B Clinic_1 2020-03-18 2020-03-22 3  3
8: K03      A Clinic_2 2020-02-03 2020-02-09 4  4
9: K04      A Clinic_2 2020-02-09 2020-02-19 4  4

Idea on the cummax came from David Aurenburg post: How to flatten / merge overlapping time periods

Separating non-overlapping intervals within groups and counting in R

Answers (2)

Related Questions