jpsmith
jpsmith

Reputation: 17175

Identifying if time intervals overlap with multiple patient IDs, in r

I have admission and discharge data from several clinics and need to determine if there is overlap in admission between 2+ patients at the same clinic with the same DNA strain of the pathogen. Working in R.

Only 5 variables: study ID, date in, date out, DNA strain, and clinic. Each row is a separate visit, meaning the "ID" variable can repeat. I need to determine how many "overlaps" there were, meaning patients with the same pathogen DNA were at the same clinic at the same time.

This is a (seemingly) simple question and much discussed on this site. I am able to identify overlaps when each ID is unique. My specific problem is how to deal with the repeating IDs. In the below code, I provide sample data and my code that works when each ID is unique (you can test this quickly by changing seq(1:20)),each=2) to seq(1:40)),each=1) in the first line of the code). If I run this code with repeating IDs, it flags all visits as overlap=1 for an ID with any overlaps, regardless if it was the actual the visit that overlapped.

I know that overlapping date intervals is a topic of much discussion, so please trust that I have thoroughly looked at everything I could find, and almost made it except for this one last part. Any advice is much appreciated.

Thank you!

#Set globals
set.seed(8126) 
library(lubridate); library(data.table)

#Example Data
have<-data.frame(rep(paste0("k",seq(1:20)),each=2),sample(seq(as.Date('2015/01/01'), as.Date('2020/01/01'), by="day"), 40))
names(have)<-c("id","datein")
have$dateout<-have$datein+40 #arbitrarily add 40 days to admission date
have$dnastrain<-as.vector(replicate(20,rep(sample(c("Type 1","Type 2","Type 3"),1),each=2)))
have$clinic<-sample(c("A","B","C","D"),40, replace=TRUE)


#Code that works if each ID is unique
setDT(have); setkey(have,datein, dateout) #Convert to DT and set date in/out as keys
overlaps<-unique(foverlaps(have, have)[id!=i.id & dnastrain==i.dnastrain & clinic==i.clinic, id]) #find overlaps

have[,`:=`(overlap=0)][id %in% overlaps, overlap:=1][order(datein)] #Identify overlaps

Upvotes: 3

Views: 325

Answers (4)

captcoma
captcoma

Reputation: 1898

Here a solution I've just learned today using dplyr and purrr

library(dplyr)
library(purrr)
library(lubridate)


have %>% 
  mutate(interval=interval(datein,dateout)) %>%
  group_by(dnastrain,clinic) %>% 
  mutate(overlap = purrr::map_int(row_number(),
  ~+any(datein[.x] %within% interval[-.x] | dateout[.x] %within% interval[-.x])))

Upvotes: 1

chinsoon12
chinsoon12

Reputation: 25225

You can use a join instead of %in%:

setDT(have, key=c("datein","dateout"))
overlaps <- unique(foverlaps(have, have)[clinic==i.clinic, dnastrain==i.dnastrain & id!=i.id])

cols <- copy(names(have))
have[, overlap := 0][
    overlaps, on=cols, overlap := 1][
        order(datein)] 

output:

     id     datein    dateout dnastrain clinic overlap
 1:  k6 2015-01-01 2015-02-10    Type 1      D       0
 2:  k6 2015-01-20 2015-03-01    Type 1      D       0
 3:  k9 2015-04-09 2015-05-19    Type 2      B       1
 4: k11 2015-04-12 2015-05-22    Type 2      B       1
 5: k10 2015-04-19 2015-05-29    Type 3      C       0
 6: k10 2015-07-03 2015-08-12    Type 3      A       0
 7:  k4 2015-08-06 2015-09-15    Type 1      C       0
 8: k16 2015-08-26 2015-10-05    Type 2      A       0
 9: k13 2016-05-20 2016-06-29    Type 3      D       1
10: k19 2016-05-26 2016-07-05    Type 3      D       1
11: k15 2016-06-23 2016-08-02    Type 3      A       0
12:  k7 2016-06-29 2016-08-08    Type 3      B       0
13: k18 2016-07-18 2016-08-27    Type 2      B       0
14: k13 2016-08-11 2016-09-20    Type 3      A       0
15:  k1 2016-09-24 2016-11-03    Type 3      C       0
16: k18 2016-09-29 2016-11-08    Type 2      B       0
17: k12 2017-03-13 2017-04-22    Type 3      B       0
18:  k7 2017-03-24 2017-05-03    Type 3      C       0
19: k14 2017-05-28 2017-07-07    Type 3      B       0
20:  k3 2017-06-05 2017-07-15    Type 2      B       0
21: k17 2017-07-06 2017-08-15    Type 3      A       0
22: k17 2017-09-08 2017-10-18    Type 3      B       1
23:  k2 2017-09-30 2017-11-09    Type 3      B       1
24: k15 2017-10-07 2017-11-16    Type 3      D       0
25:  k8 2018-01-01 2018-02-10    Type 1      C       0
26:  k5 2018-07-07 2018-08-16    Type 2      B       0
27: k20 2018-07-18 2018-08-27    Type 1      A       0
28:  k5 2018-08-30 2018-10-09    Type 2      B       0
29: k16 2018-09-22 2018-11-01    Type 2      D       0
30: k14 2018-09-24 2018-11-03    Type 3      B       0
31:  k2 2018-11-02 2018-12-12    Type 3      A       1
32: k12 2018-11-13 2018-12-23    Type 3      A       1
33: k19 2018-11-21 2018-12-31    Type 3      D       0
34:  k3 2018-11-30 2019-01-09    Type 2      A       0
35:  k1 2018-12-14 2019-01-23    Type 3      A       1
36:  k9 2019-03-13 2019-04-22    Type 2      D       0
37:  k8 2019-07-18 2019-08-27    Type 1      C       1
38: k20 2019-08-10 2019-09-19    Type 1      C       1
39: k11 2019-08-30 2019-10-09    Type 2      B       0
40:  k4 2019-09-04 2019-10-14    Type 1      B       0
     id     datein    dateout dnastrain clinic overlap

data:

set.seed(8126) 
library(data.table)

have<-data.frame(rep(paste0("k",seq(1:20)),each=2),sample(seq(as.Date('2015/01/01'), as.Date('2020/01/01'), by="day"), 40))
names(have)<-c("id","datein")
have$dateout<-have$datein+40 #arbitrarily add 40 days to admission date
have$dnastrain<-as.vector(replicate(20,rep(sample(c("Type 1","Type 2","Type 3"),1),each=2)))
have$clinic<-sample(c("A","B","C","D"),40, replace=TRUE)

Upvotes: 2

Patrick
Patrick

Reputation: 326

This is a dplyr method.


have$rownum <- 1:nrow(have)

 crossover <- have %>% 
   inner_join(have, by = c("clinic", "dnastrain"), suffix=(c(".a",".b"))) %>% 
   filter(datein.a<=dateout.b & 
          dateout.a>=datein.b &
          id.a!=id.b & 
          rownum.a < rownum.b
          )


crossover
  id.a   datein.a  dateout.a dnastrain clinic rownum.a id.b   datein.b  dateout.b rownum.b
1   k1 2017-11-02 2017-12-12    Type 3      B        1  k18 2017-10-03 2017-11-12       35
2  k10 2016-02-15 2016-03-26    Type 3      A       19  k13 2016-03-22 2016-05-01       26
3  k17 2017-08-06 2017-09-15    Type 2      C       33  k20 2017-09-02 2017-10-12       39

Upvotes: 4

Adam B.
Adam B.

Reputation: 1180

I figured out a way to do it a bit differently using purrr:


participant_in <- have$datein
participant_out <- have$dateout

# For each participant, create a vector of days at which they were in a clinic
days <- map2(participant_in, participant_out, ~ seq(from = ymd(.x), to = ymd(.y), by = 'days'))
ids <- map2(have$id, days, ~ rep(.x, length(.y)))

days <- days %>% reduce(c)
ids <- ids %>% unlist()

participant_tib <- tibble(id = ids, day = days) %>%
  left_join(have, by = 'id') %>%
  select(-datein, -dateout) 

participant_tib %>%
  group_by(day, dnastrain, clinic) %>%
  count() %>%
  arrange(desc(n))

With the ^above, you can check on which days there were several participants with the same strain in the same clinic. From there, you could filter for days in which n > 1, and check the participant id's for those.

Upvotes: 3

Related Questions