HNSKD
HNSKD

Reputation: 1644

How can we check if any 2 intervals of a unique ID overlaps?

I have data of patient prescription of oral DM drugs, i.e. DPP4 and SU, and would like to find out if patients had taken the drugs concurrently (i.e. whether there are overlapping intervals for DPP4 and SU within the same patient ID).

Sample data:

  ID DRUG      START        END
1  1 DPP4 2020-01-01 2020-01-20
2  1 DPP4 2020-03-01 2020-04-01
3  1   SU 2020-03-15 2020-04-30
4  2   SU 2020-10-01 2020-10-31
5  2 DPP4 2020-12-01 2020-12-31

In the sample data above,

  1. ID == 1, patient had DPP4 and SU concurrently from 2020-03-15 to 2020-04-01.
  2. ID == 2, patient had consumed both medications at separate intervals.

I thought of splitting the data into 2, one for DPP4 and another for SU. Then, do a full join, and compare each DPP4 interval with each SU interval. This may be okay for small data, but if a patient has like 5 rows for DPP4 and another 5 for SU, we will have 25 comparisons, which may not be efficient. Add that with 10000+ patients.

I am not sure how to do it.

New data:

Hope to have a new df that looks like this. Or anything that is tidy.

  ID    DRUG      START        END
1  1 DPP4-SU 2020-03-15 2020-04-01
2  2    <NA>       <NA>       <NA>

Data Code:

df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), DRUG = c("DPP4", "DPP4", 
"SU", "SU", "DPP4"), START = structure(c(18262, 18322, 18336, 
18536, 18597), class = "Date"), END = structure(c(18281, 18353, 
18382, 18566, 18627), class = "Date")), class = "data.frame", row.names = c(NA, 
-5L))

df_new <- structure(list(ID = 1:2, DRUG = c("DPP4-SU", NA), START = structure(c(18336, 
NA), class = "Date"), END = structure(c(18353, NA), class = "Date")), class = "data.frame", row.names = c(NA, 
-2L))

Edit: I think from the sample data I gave, it may seem that there can only be 1 intersecting interval. But there may be more. So, I think this would be better data to illustrate.

structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c("DPP4", 
"DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"), START = structure(c(17004, 
17383, 17383, 17418, 17437, 17649, 17676), class = c("IDate", 
"Date")), END = structure(c(17039, 17405, 17405, 17521, 17625, 
17669, 17711), class = c("IDate", "Date")), duration = c(35L, 
22L, 22L, 103L, 188L, 20L, 35L), INDEX = c(1L, 0L, 0L, 0L, 0L, 
0L, 0L)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", 
"data.frame"))

Upvotes: 9

Views: 220

Answers (4)

Anoushiravan R
Anoushiravan R

Reputation: 21908

Updated Solution I have made considerable modifications based on the newly provided data set. This time I first created interval for each START and END pair and extract the intersecting period between them. As dear Martin nicely made use of them we could use lubridate::int_start and lubridate::int_end to extract the START and END date of each interval:

library(dplyr)
library(lubridate)
library(purrr)
library(tidyr)

df %>%
  group_by(ID) %>%
  arrange(START, END) %>%
  mutate(int = interval(START, END),
         is_over = c(NA, map2(int[-n()], int[-1], 
                              ~ intersect(.x, .y)))) %>%
  unnest(cols = c(is_over)) %>% 
  select(-int) %>%
  filter(!is.na(is_over) | !is.na(lead(is_over))) %>%
  select(!c(START, END)) %>%
  mutate(grp = cumsum(is.na(is_over))) %>%
  group_by(grp) %>%
  summarise(ID = first(ID), 
            DRUG = paste0(DRUG, collapse = "-"), 
            is_over = na.omit(is_over)) %>%
  mutate(START = int_start(is_over), 
         END = int_end(is_over)) %>%
  select(!is_over)

# A tibble: 1 x 5
    grp    ID DRUG    START               END                
  <int> <int> <chr>   <dttm>              <dttm>             
1     1     1 DPP4-SU 2020-03-15 00:00:00 2020-04-01 00:00:00

Second data set:

# A tibble: 2 x 5
    grp    ID DRUG    START               END                
  <int> <dbl> <chr>   <dttm>              <dttm>             
1     1     3 DPP4-SU 2017-08-05 00:00:00 2017-08-27 00:00:00
2     2     3 SU-DPP4 2017-09-28 00:00:00 2017-12-21 00:00:00

Upvotes: 6

ThomasIsCoding
ThomasIsCoding

Reputation: 101064

Update

As per updated df

df <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c(
  "DPP4",
  "DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"
), START = structure(c(
  17004,
  17383, 17383, 17418, 17437, 17649, 17676
), class = c(
  "IDate",
  "Date"
)), END = structure(c(
  17039, 17405, 17405, 17521, 17625,
  17669, 17711
), class = c("IDate", "Date")), duration = c(
  35L,
  22L, 22L, 103L, 188L, 20L, 35L
), INDEX = c(
  1L, 0L, 0L, 0L, 0L,
  0L, 0L
)), row.names = c(NA, -7L), class = c(
  "tbl_df", "tbl",
  "data.frame"
))

we obtain

> dfnew
    ID    DRUG      start        end
3.3  3 DPP4-SU 2017-08-05 2017-08-27
3.7  3 SU-DPP4 2017-09-28 2017-12-21

A base R option (not as fancy as the answers by @Anoushiravan R or @Martin Gal)

f <- function(d) {
  d <- d[with(d, order(START, END)), ]
  idx <- subset(
    data.frame(which((u <- with(d, outer(START, END, `<`))) & t(u), arr.ind = TRUE)),
    row > col
  )
  if (nrow(idx) == 0) {
    return(data.frame(ID = unique(d$ID), DRUG = NA, start = NA, end = NA))
  }
  with(
    d,
    do.call(rbind, 
    apply(
      idx,
      1,
      FUN = function(v) {
        data.frame(
          ID = ID[v["row"]],
          DRUG = paste0(DRUG[sort(unlist(v))], collapse = "-"),
          start = START[v["row"]],
          end = END[v["col"]]
        )
      }
    ))
  )
}

dfnew <- do.call(rbind, Map(f, split(df, ~ID)))

gives

> dfnew
  ID    DRUG      start        end
1  1 DPP4-SU 2020-03-15 2020-04-01
2  2    <NA>       <NA>       <NA>

Upvotes: 4

AnilGoyal
AnilGoyal

Reputation: 26218

You may use a slightly different approach from the above answers, but this will give you results in format different than required. Obviously, these can be joined to get expected results. You may try this

df <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c("DPP4",  "DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"), START = structure(c(17004,  17383, 17383, 17418, 17437, 17649, 17676), class = c("IDate",  "Date")), END = structure(c(17039, 17405, 17405, 17521, 17625,  17669, 17711), class = c("IDate", "Date"))), row.names = c(NA,  -7L), class = c("tbl_df", "tbl", "data.frame"))

df
#> # A tibble: 7 x 4
#>      ID DRUG  START      END       
#>   <dbl> <chr> <date>     <date>    
#> 1     3 DPP4  2016-07-22 2016-08-26
#> 2     3 DPP4  2017-08-05 2017-08-27
#> 3     3 SU    2017-08-05 2017-08-27
#> 4     3 SU    2017-09-09 2017-12-21
#> 5     3 DPP4  2017-09-28 2018-04-04
#> 6     3 DPP4  2018-04-28 2018-05-18
#> 7     3 DPP4  2018-05-25 2018-06-29
library(tidyverse)

df %>%
  mutate(treatment_id = row_number()) %>%
  pivot_longer(c(START, END), names_to = 'event', values_to = 'dates') %>%
  mutate(event = factor(event, levels = c('END', 'START'), ordered = TRUE)) %>%
  group_by(ID) %>%
  arrange(dates, event, .by_group = TRUE) %>%
  mutate(overlap = cumsum(ifelse(event == 'START', 1, -1))) %>%
  filter((overlap > 1 & event == 'START') | (overlap > 0 & event == 'END'))
#> # A tibble: 4 x 6
#> # Groups:   ID [1]
#>      ID DRUG  treatment_id event dates      overlap
#>   <dbl> <chr>        <int> <ord> <date>       <dbl>
#> 1     3 SU               3 START 2017-08-05       2
#> 2     3 DPP4             2 END   2017-08-27       1
#> 3     3 DPP4             5 START 2017-09-28       2
#> 4     3 SU               4 END   2017-12-21       1

on originally provided data

# A tibble: 2 x 6
# Groups:   ID [1]
     ID DRUG  treatment_id event dates      overlap
  <int> <chr>        <int> <ord> <date>       <dbl>
1     1 SU               3 START 2020-03-15       2
2     1 DPP4             2 END   2020-04-01       1

For transforming/getting results in original shape, you may filter overlapping rows

library(tidyverse)
df_new <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c("DPP4",  "DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"), START = structure(c(17004,  17383, 17383, 17418, 17437, 17649, 17676), class = c("IDate",  "Date")), END = structure(c(17039, 17405, 17405, 17521, 17625,  17669, 17711), class = c("IDate", "Date"))), row.names = c(NA,  -7L), class = c("tbl_df", "tbl", "data.frame"))



df_new %>%
  mutate(treatment_id = row_number()) %>%
  pivot_longer(c(START, END), names_to = 'event', values_to = 'dates') %>%
  mutate(event = factor(event, levels = c('END', 'START'), ordered = TRUE)) %>%
  group_by(ID) %>%
  arrange(dates, event, .by_group = TRUE) %>%
  mutate(overlap = cumsum(ifelse(event == 'START', 1, -1))) %>%
  filter((overlap > 1 & event == 'START') | (overlap > 0 & event == 'END')) %>%
  left_join(df_new %>% mutate(treatment_id = row_number()), by = c('ID', 'DRUG', 'treatment_id'))
#> # A tibble: 4 x 8
#> # Groups:   ID [1]
#>      ID DRUG  treatment_id event dates      overlap START      END       
#>   <dbl> <chr>        <int> <ord> <date>       <dbl> <date>     <date>    
#> 1     3 SU               3 START 2017-08-05       2 2017-08-05 2017-08-27
#> 2     3 DPP4             2 END   2017-08-27       1 2017-08-05 2017-08-27
#> 3     3 DPP4             5 START 2017-09-28       2 2017-09-28 2018-04-04
#> 4     3 SU               4 END   2017-12-21       1 2017-09-09 2017-12-21

Created on 2021-08-10 by the reprex package (v2.0.0)

Upvotes: 3

Martin Gal
Martin Gal

Reputation: 16978

It's way more complicated than dear @AnoushiravanR's but as an alternative you could try

library(dplyr)
library(tidyr)
library(lubridate)

df %>% 
  full_join(x = ., y = ., by = "ID") %>% 
#  filter(DRUG.x != DRUG.y | START.x != START.y | END.x != END.y) %>% 
  filter(DRUG.x != DRUG.y) %>%
  group_by(ID, intersection = intersect(interval(START.x, END.x), interval(START.y, END.y))) %>% 
  drop_na(intersection) %>% 
  filter(START.x == first(START.x)) %>% 
  summarise(DRUG  = paste(DRUG.x, DRUG.y, sep = "-"),
            START = as_date(int_start(intersection)),
            END   = as_date(int_end(intersection)),
            .groups = "drop") %>% 
  select(-intersection)

returning

# A tibble: 1 x 4
     ID DRUG    START      END       
  <int> <chr>   <date>     <date>    
1     1 DPP4-SU 2020-03-15 2020-04-01

Edit: Changed the filter condition. The former one was flawed.

Upvotes: 6

Related Questions