Reputation: 770
I have multiple rows with the same ID, and date ranges for each row. Sometimes, these date ranges overlap. I need to identify rows where they overlap.
EG Dataset:
eg_data <- data.frame(
id = c(1,1,1, 2,2, 3,3,3,3,3,3, 4,4, 5,5,5,5),
start_dt = c("01/01/2016", "12/02/2016", "03/12/2017", "02/01/2016",
"08/12/2016", "01/01/2016", "03/05/2016", "05/07/2016", "07/01/2016",
"09/04/2016", "10/10/2016", "01/01/2016", "05/28/2016", "01/01/2016",
"06/05/2016", "08/25/2016", "11/01/2016"),
end_dt = c("12/01/2016", "03/14/2017", "05/15/2017", "05/15/2016",
"12/29/2016", "03/02/2016", "04/29/2016", "06/29/2016", "08/31/2016",
"09/25/2016", "11/29/2016", "05/31/2016", "08/19/2016", "06/10/2016",
"07/25/2016", "08/29/2016", "12/30/2016"))
eg_data$row_n <- 1:nrow(eg_data)
Run the eg data above, and you see that the
start date for row 3 overlaps the end date of row 2 for ID #1; start date for row 13 overlaps the end date of row 12 for ID #4; and start date for row 15 overlaps the end date of row 14 for ID #5.
I need to be able to identify when this type of overlaps happens, for a single ID number.
Upvotes: 1
Views: 1662
Reputation: 2950
I think that this can be done pretty nicely with the ivs package, a package for working with generic interval vectors, which is exactly what you have here. Interval vectors combine two columns of start/end points into a single interval column, which you can then manipulate with one of the many iv_*()
functions. In particular, I used iv_identify_group()
here to identify the within-id overlaps.
library(dplyr)
library(ivs)
eg_data <- tibble(
id = c(1,1,1,2,2,3,3,3,3,3,3,4,4,5,5,5,5),
start_dt = c(
"01/01/2016", "12/02/2016", "03/12/2017", "02/01/2016",
"08/12/2016", "01/01/2016", "03/05/2016", "05/07/2016", "07/01/2016",
"09/04/2016", "10/10/2016", "01/01/2016", "05/28/2016", "01/01/2016",
"06/05/2016", "08/25/2016", "11/01/2016"
),
end_dt = c(
"12/01/2016", "03/14/2017", "05/15/2017", "05/15/2016",
"12/29/2016", "03/02/2016", "04/29/2016", "06/29/2016", "08/31/2016",
"09/25/2016", "11/29/2016", "05/31/2016", "08/19/2016", "06/10/2016",
"07/25/2016", "08/29/2016", "12/30/2016"
)
)
eg_data$row_n <- 1:nrow(eg_data)
# Parse the dates into real Date objects and then combine the endpoints into an
# interval vector
eg_data <- eg_data %>%
mutate(
start_dt = as.Date(start_dt, format = "%m/%d/%Y"),
end_dt = as.Date(end_dt, format = "%m/%d/%Y")
) %>%
mutate(
range = iv(start_dt, end_dt),
.keep = "unused"
)
# Note that they are half-open!
head(eg_data, n = 4)
#> # A tibble: 4 × 3
#> id row_n range
#> <dbl> <int> <iv<date>>
#> 1 1 1 [2016-01-01, 2016-12-01)
#> 2 1 2 [2016-12-02, 2017-03-14)
#> 3 1 3 [2017-03-12, 2017-05-15)
#> 4 2 4 [2016-02-01, 2016-05-15)
# For each id, identify the wider "overlap group" that each date range falls in
groups <- eg_data %>%
group_by(id) %>%
mutate(group = iv_identify_group(range)) %>%
group_by(group, .add = TRUE)
# Note that rows 2 and 3 are in the same overlap group within id 1
groups
#> # A tibble: 17 × 4
#> # Groups: id, group [14]
#> id row_n range group
#> <dbl> <int> <iv<date>> <iv<date>>
#> 1 1 1 [2016-01-01, 2016-12-01) [2016-01-01, 2016-12-01)
#> 2 1 2 [2016-12-02, 2017-03-14) [2016-12-02, 2017-05-15)
#> 3 1 3 [2017-03-12, 2017-05-15) [2016-12-02, 2017-05-15)
#> 4 2 4 [2016-02-01, 2016-05-15) [2016-02-01, 2016-05-15)
#> 5 2 5 [2016-08-12, 2016-12-29) [2016-08-12, 2016-12-29)
#> 6 3 6 [2016-01-01, 2016-03-02) [2016-01-01, 2016-03-02)
#> 7 3 7 [2016-03-05, 2016-04-29) [2016-03-05, 2016-04-29)
#> 8 3 8 [2016-05-07, 2016-06-29) [2016-05-07, 2016-06-29)
#> 9 3 9 [2016-07-01, 2016-08-31) [2016-07-01, 2016-08-31)
#> 10 3 10 [2016-09-04, 2016-09-25) [2016-09-04, 2016-09-25)
#> 11 3 11 [2016-10-10, 2016-11-29) [2016-10-10, 2016-11-29)
#> 12 4 12 [2016-01-01, 2016-05-31) [2016-01-01, 2016-08-19)
#> 13 4 13 [2016-05-28, 2016-08-19) [2016-01-01, 2016-08-19)
#> 14 5 14 [2016-01-01, 2016-06-10) [2016-01-01, 2016-07-25)
#> 15 5 15 [2016-06-05, 2016-07-25) [2016-01-01, 2016-07-25)
#> 16 5 16 [2016-08-25, 2016-08-29) [2016-08-25, 2016-08-29)
#> 17 5 17 [2016-11-01, 2016-12-30) [2016-11-01, 2016-12-30)
# From there you can group by `id` and `group` and count them up
groups %>%
count()
#> # A tibble: 14 × 3
#> # Groups: id, group [14]
#> id group n
#> <dbl> <iv<date>> <int>
#> 1 1 [2016-01-01, 2016-12-01) 1
#> 2 1 [2016-12-02, 2017-05-15) 2
#> 3 2 [2016-02-01, 2016-05-15) 1
#> 4 2 [2016-08-12, 2016-12-29) 1
#> 5 3 [2016-01-01, 2016-03-02) 1
#> 6 3 [2016-03-05, 2016-04-29) 1
#> 7 3 [2016-05-07, 2016-06-29) 1
#> 8 3 [2016-07-01, 2016-08-31) 1
#> 9 3 [2016-09-04, 2016-09-25) 1
#> 10 3 [2016-10-10, 2016-11-29) 1
#> 11 4 [2016-01-01, 2016-08-19) 2
#> 12 5 [2016-01-01, 2016-07-25) 2
#> 13 5 [2016-08-25, 2016-08-29) 1
#> 14 5 [2016-11-01, 2016-12-30) 1
Created on 2022-04-05 by the reprex package (v2.0.1)
Upvotes: 0
Reputation: 269694
First convert the dates to Date
class. Then a self join on id
and the intersection criteria will join all relevant overlapping rows. overlap
is 1 if that row has an overlap and 0 otherwise. overlaps
lists the row numbers of the overlaps for that row. We used row numbers rowid
but we could replace each occurrence of it in the code below with row_n
if desired.
library(sqldf)
fmt <- "%m/%d/%Y"
eg2 <- transform(eg_data,
start_dt = as.Date(start_dt, fmt),
end_dt = as.Date(end_dt, fmt))
sqldf("select
a.*,
count(b.rowid) > 0 as overlap,
coalesce(group_concat(b.rowid), '') as overlaps
from eg2 a
left join eg2 b on a.id = b.id and
not a.rowid = b.rowid and
((a.start_dt between b.start_dt and b.end_dt) or
(b.start_dt between a.start_dt and a.end_dt))
group by a.rowid
order by a.rowid")
giving:
id start_dt end_dt row_n overlap overlaps
1 1 2016-01-01 2016-12-01 1 0
2 1 2016-12-02 2017-03-14 2 1 3
3 1 2017-03-12 2017-05-15 3 1 2
4 2 2016-02-01 2016-05-15 4 0
5 2 2016-08-12 2016-12-29 5 0
6 3 2016-01-01 2016-03-02 6 0
7 3 2016-03-05 2016-04-29 7 0
8 3 2016-05-07 2016-06-29 8 0
9 3 2016-07-01 2016-08-31 9 0
10 3 2016-09-04 2016-09-25 10 0
11 3 2016-10-10 2016-11-29 11 0
12 4 2016-01-01 2016-05-31 12 1 13
13 4 2016-05-28 2016-08-19 13 1 12
14 5 2016-01-01 2016-06-10 14 1 15
15 5 2016-06-05 2016-07-25 15 1 14
16 5 2016-08-25 2016-08-29 16 0
17 5 2016-11-01 2016-12-30 17 0
Upvotes: 3