Identifying overlapping date intervals on grouped data

Question

I have a large dataset where I want to identify observations that overlap in time and space. Each observation has a unique ID and I have already identified those that overlap in space, given by overlap_space. Now, I want to check whether the start/end-points of the observations that overlap in space also overlap.

A simple example is given below:

start <- c("2007-06-27", "2010-06-30", "2015-01-01", "2012-01-01", "2010-01-01", "2009-01-01")
end <- c("2008-10-01", "2010-07-01", "2017-02-02", "2013-01-01", "2010-07-03", "2012-01-01")
df <- data.frame(id = c(1:6),
                 start = as.Date(start, format = "%Y-%m-%d"),
                 end = as.Date(end, format = "%Y-%m-%d"),
                 overlap_id = as.character(c("2, 4", "1, 3, 5", "2, 5", "1, 5, 6", "2, 3, 4", "4")))```

> df
  id      start        end overlap_id
1  1 2007-06-27 2008-10-01       2, 4
2  2 2010-06-30 2010-07-01    1, 3, 5
3  3 2015-01-01 2017-02-02       2, 5
4  4 2012-01-01 2013-01-01    1, 5, 6
5  5 2010-01-01 2010-07-03    2, 3, 4
6  6 2009-01-01 2012-01-01          4

Not all spatially overlapping IDs overlap in time. How do I identify the ones that do? In other words, I need to match on overlap_id (which can be made into a longer format with tidyr::separate_rows(overlap_id) as well as on start/end date. I have tried to make date intervals using lubridate::interval, but I have not been able to assure that the overlaps are constrained to to those identified in overlap_id.

This is the output I would like:

> df
  id      start        end overlap_id  time_overlap overlap_dummy
1  1 2007-06-27 2008-10-01       2, 4           NA             0 
2  2 2010-06-30 2010-07-01    1, 3, 5            5             1
3  3 2015-01-01 2017-02-02       2, 5           NA             0
4  4 2012-01-01 2013-01-01    1, 5, 6            6             1
5  5 2010-01-01 2010-07-03    2, 3, 4            2             1
6  6 2009-01-01 2012-01-01          4            6             1

Any help would be greatly appreciated! Thanks.

Andrew Gustar · Accepted Answer

Here is one way...

library(lubridate)
library(tidyverse)

df2 <- df %>% separate_rows(overlap_id, convert = TRUE) %>%   #spread rows
  left_join(df %>% select(-overlap_id) %>%                    #add dates for overlap rows
                   rename(start1 = start,                     #avoid name clash on join
                          end1 = end),
            by = c("overlap_id" = "id")) %>% 
  filter(int_overlaps(interval(start, end),                   #delete non-overlapping
                      interval(start1, end1))) %>% 
  group_by(id) %>% 
  summarise(time_overlap = paste(overlap_id, sep = ", ")) %>% #paste overlaps if more than 1
  right_join(df)                                              #merge back to df (by id)

df2
# A tibble: 6 x 5
     id time_overlap start      end        overlap_id
                          
1     1 NA           2007-06-27 2008-10-01 2, 4      
2     2 5            2010-06-30 2010-07-01 1, 3, 5   
3     3 NA           2015-01-01 2017-02-02 2, 5      
4     4 6            2012-01-01 2013-01-01 1, 5, 6   
5     5 2            2010-01-01 2010-07-03 2, 3, 4   
6     6 4            2009-01-01 2012-01-01 4

Identifying overlapping date intervals on grouped data

Answers (1)

Related Questions