Reputation: 17731
Using R, I have inpatient data that I have grouped by DNA strain (of the pathogen), clinic of inpatient stay, and overlapping timeframe of the stay to determine if transmission is possible.
I need to sequentially number the overlapping groups. This would appear quite simple, but two issues:
%>% group_by(strain, clinic) %>%
, but this doesn't account for non-overlapping time intervals I have tried several approaches and search before finally giving in and posting here (none of my attempts are worthy of event posting here to waste your time.) The below code is an example of the data I have (have
) and data I want (want
). Note for strain B
, all patients are in Clinic_1
but there are two groups due to a separation in time intervals.
Any advice would be much appreciated.
have <- data.frame(id=c("K01","K02","K03","K04","K05","K06","K07","K08","K09"),
strain=c(rep("A",4),rep("B",5)),
clinic=c(rep("Clinic_1",2),rep("Clinic_2",2),rep("Clinic_1",5)),
datein=as.Date(c("2020/01/01","2020/01/03","2020/02/03","2020/02/09","2020/02/18","2020/02/20","2020/02/21","2020/03/06","2020/03/18")),
dateout=as.Date(c("2020/01/05","2020/01/16","2020/02/09","2020/02/19","2020/02/27","2020/02/23","2020/02/22","2020/03/21","2020/03/22"))
)
want <- data.frame(have,overlap_number=c(1,1,2,2,3,3,3,4,4))
#How the final data would look
> View(want)
id strain clinic datein dateout overlap_number
1 K01 A Clinic_1 2020-01-01 2020-01-05 1
2 K02 A Clinic_1 2020-01-03 2020-01-16 1
3 K03 A Clinic_2 2020-02-03 2020-02-09 2
4 K04 A Clinic_2 2020-02-09 2020-02-19 2
5 K05 B Clinic_1 2020-02-18 2020-02-27 3
6 K06 B Clinic_1 2020-02-20 2020-02-23 3
7 K07 B Clinic_1 2020-02-21 2020-02-22 3
8 K08 B Clinic_1 2020-03-06 2020-03-21 4
9 K09 B Clinic_1 2020-03-18 2020-03-22 4
An alternative dataset based on Akrun's comment, changing dates slightly for K07
:
have2 <- data.frame(id=c("K01","K02","K03","K04","K05","K06","K07","K08","K09"),
strain=c(rep("A",4),rep("B",5)),
clinic=c(rep("Clinic_1",2),rep("Clinic_2",2),rep("Clinic_1",5)),
datein=as.Date(c("2020/01/01","2020/01/03","2020/02/03","2020/02/09","2020/02/18","2020/02/20","2020/02/25","2020/03/06","2020/03/18")),
dateout=as.Date(c("2020/01/05","2020/01/16","2020/02/09","2020/02/19","2020/02/27","2020/02/23","2020/02/29","2020/03/21","2020/03/22"))
)
#Output:
#> have2 %>%
#+ mutate(overlap_number = rleid(strain, clinic,
#+ cumsum(datein > lag(dateout, default = #first(dateout)))))
# id strain clinic datein dateout overlap_number
#1 K01 A Clinic_1 2020-01-01 2020-01-05 1
#2 K02 A Clinic_1 2020-01-03 2020-01-16 1
#3 K03 A Clinic_2 2020-02-03 2020-02-09 2
#4 K04 A Clinic_2 2020-02-09 2020-02-19 2
#5 K05 B Clinic_1 2020-02-18 2020-02-27 3
#6 K06 B Clinic_1 2020-02-20 2020-02-23 3
#7 K07 B Clinic_1 2020-02-25 2020-02-29 4 ## treats this as single, should be 3
#8 K08 B Clinic_1 2020-03-06 2020-03-21 5 ## should be 4
#9 K09 B Clinic_1 2020-03-18 2020-03-22 5 ## should be 4
Upvotes: 2
Views: 156
Reputation: 25223
An option using data.table
:
setkey(setDT(have), clinic, strain, datein, dateout)
have[, g := cumsum(c(0L, (shift(datein, -1L) > cummax(as.integer(dateout)))[-.N])),
.(clinic, strain)][,
g := rleid(clinic, strain, g)]
Also:
have[, g02 := cumsum(datein > shift(cummax(as.integer(dateout)), fill=dateout[1L])),
.(clinic, strain)][,
g2 := rleid(clinic, strain, g02)]
output:
id strain clinic datein dateout g g2
1: K01 A Clinic_1 2020-01-01 2020-01-05 1 1
2: K02 A Clinic_1 2020-01-03 2020-01-16 1 1
3: K05 B Clinic_1 2020-02-18 2020-02-27 2 2
4: K06 B Clinic_1 2020-02-20 2020-02-23 2 2
5: K07 B Clinic_1 2020-02-21 2020-02-22 2 2
6: K08 B Clinic_1 2020-03-06 2020-03-21 3 3
7: K09 B Clinic_1 2020-03-18 2020-03-22 3 3
8: K03 A Clinic_2 2020-02-03 2020-02-09 4 4
9: K04 A Clinic_2 2020-02-09 2020-02-19 4 4
Idea on the cummax
came from David Aurenburg post: How to flatten / merge overlapping time periods
Upvotes: 1
Reputation: 887851
We can use rleid
library(dplyr)
library(data.table)
have %>%
mutate(overlap_number = rleid(strain, clinic,
cumsum(datein > lag(dateout, default = first(dateout)))))
# id strain clinic datein dateout overlap_number
#1 K01 A Clinic_1 2020-01-01 2020-01-05 1
#2 K02 A Clinic_1 2020-01-03 2020-01-16 1
#3 K03 A Clinic_2 2020-02-03 2020-02-09 2
#4 K04 A Clinic_2 2020-02-09 2020-02-19 2
#5 K05 B Clinic_1 2020-02-18 2020-02-27 3
#6 K06 B Clinic_1 2020-02-20 2020-02-23 3
#7 K07 B Clinic_1 2020-02-21 2020-02-22 3
#8 K08 B Clinic_1 2020-03-06 2020-03-21 4
#9 K09 B Clinic_1 2020-03-18 2020-03-22 4
Upvotes: 1