Reputation: 67
I have a large dataset where I want to identify observations that overlap in time and space. Each observation has a unique ID and I have already identified those that overlap in space, given by overlap_space
. Now, I want to check whether the start/end-points of the observations that overlap in space also overlap.
A simple example is given below:
start <- c("2007-06-27", "2010-06-30", "2015-01-01", "2012-01-01", "2010-01-01", "2009-01-01")
end <- c("2008-10-01", "2010-07-01", "2017-02-02", "2013-01-01", "2010-07-03", "2012-01-01")
df <- data.frame(id = c(1:6),
start = as.Date(start, format = "%Y-%m-%d"),
end = as.Date(end, format = "%Y-%m-%d"),
overlap_id = as.character(c("2, 4", "1, 3, 5", "2, 5", "1, 5, 6", "2, 3, 4", "4")))```
> df
id start end overlap_id
1 1 2007-06-27 2008-10-01 2, 4
2 2 2010-06-30 2010-07-01 1, 3, 5
3 3 2015-01-01 2017-02-02 2, 5
4 4 2012-01-01 2013-01-01 1, 5, 6
5 5 2010-01-01 2010-07-03 2, 3, 4
6 6 2009-01-01 2012-01-01 4
Not all spatially overlapping IDs overlap in time. How do I identify the ones that do? In other words, I need to match on overlap_id
(which can be made into a longer format with tidyr::separate_rows(overlap_id)
as well as on start/end date. I have tried to make date intervals using lubridate::interval
, but I have not been able to assure that the overlaps are constrained to to those identified in overlap_id
.
This is the output I would like:
> df
id start end overlap_id time_overlap overlap_dummy
1 1 2007-06-27 2008-10-01 2, 4 NA 0
2 2 2010-06-30 2010-07-01 1, 3, 5 5 1
3 3 2015-01-01 2017-02-02 2, 5 NA 0
4 4 2012-01-01 2013-01-01 1, 5, 6 6 1
5 5 2010-01-01 2010-07-03 2, 3, 4 2 1
6 6 2009-01-01 2012-01-01 4 6 1
Any help would be greatly appreciated! Thanks.
Upvotes: 0
Views: 161
Reputation: 18425
Here is one way...
library(lubridate)
library(tidyverse)
df2 <- df %>% separate_rows(overlap_id, convert = TRUE) %>% #spread rows
left_join(df %>% select(-overlap_id) %>% #add dates for overlap rows
rename(start1 = start, #avoid name clash on join
end1 = end),
by = c("overlap_id" = "id")) %>%
filter(int_overlaps(interval(start, end), #delete non-overlapping
interval(start1, end1))) %>%
group_by(id) %>%
summarise(time_overlap = paste(overlap_id, sep = ", ")) %>% #paste overlaps if more than 1
right_join(df) #merge back to df (by id)
df2
# A tibble: 6 x 5
id time_overlap start end overlap_id
<int> <chr> <date> <date> <fct>
1 1 NA 2007-06-27 2008-10-01 2, 4
2 2 5 2010-06-30 2010-07-01 1, 3, 5
3 3 NA 2015-01-01 2017-02-02 2, 5
4 4 6 2012-01-01 2013-01-01 1, 5, 6
5 5 2 2010-01-01 2010-07-03 2, 3, 4
6 6 4 2009-01-01 2012-01-01 4
Upvotes: 2