JFG123
JFG123

Reputation: 597

Separate overlaps in start and end range into own row of data frame

I have this data:

start_data <- data.frame(stringsAsFactors=FALSE,
                          Person = c(1, 1, 1, 1),
                          Event = c(1, 2, 3, 4),
                          Var1 = c(1, 2, 3, 5),
                          Var2 = c(7, 8, 9, 6),
                          Var3 = c(13, 14, 15, 7),
                          Start_Date = c("1/01/2020", "5/01/2020", "21/01/2020", "23/01/2020"),
                          End_Date = c("10/01/2020", "20/01/2020", "30/01/2020", "25/01/2020")
)
 
start_data
  Person Event Var1 Var2 Var3 Start_Date   End_Date
1      1     1    1    7   13  1/01/2020 10/01/2020
2      1     2    2    8   14  5/01/2020 20/01/2020
3      1     3    3    9   15 21/01/2020 30/01/2020
4      1     4    5    6    7 23/01/2020 25/01/2020

And I want to convert it into this:

end_data <- data.frame(stringsAsFactors=FALSE,
                        Person = c(1, 1, 1, 1, 1, 1),
                        Event = c("1", "1 AND 2", "2", "3", "3 AND 4", "3"),
                        Var1 = c("1", "1 AND 2", "2", "3", "3 AND 5", "3"),
                        Var2 = c("7", "7 AND 8", "8", "9", "9 AND 6", "9"),
                        Var3 = c(13, 14, 14, 15, 15, 15),
                        Start_Date = c("1/01/2020", "5/01/2020", "11/01/2020", "21/01/2020",
                                       "23/01/2020", "26/01/2020"),
                        End_Date = c("4/01/2020", "10/01/2020", "20/01/2020", "22/01/2020",
                                     "25/01/2020", "30/01/2020")
)
 
end_data
  Person   Event    Var1    Var2 Var3 Start_Date   End_Date
1      1       1       1       7   13  1/01/2020  4/01/2020
2      1 1 AND 2 1 AND 2 7 AND 8   14  5/01/2020 10/01/2020
3      1       2       2       8   14 11/01/2020 20/01/2020
4      1       3       3       9   15 21/01/2020 22/01/2020
5      1 3 AND 4 3 AND 5 9 AND 6   15 23/01/2020 25/01/2020
6      1       3       3       9   15 26/01/2020 30/01/2020

The code should expand the data frame to isolate overlapping start_date and end_date ranges. When there is an overlapping range it should create a new row with the overlapping data. Therefore, when looking at the final table, there should be no Start_Date and End_Date ranges that overlap with one another. Further, the result of the table should be summarised for Event, Var1 and Var3 by concatenating the results. Var 3 should also be aggregated by taking the max value in the overlapping ranges.

Ideally, I want to apply this code to a number of "Persons" so playing nice with group_by() or nesting with dplyr would be preferential.

Edit:

In the case of 3 overlapping periods as per a question below. It would look like this; enter image description here

Edit2:

The solution by @ekoam is very close. However, it does not deal with the below example. Event 5 encapsulates the whole period. Hence, there should be no missing ranges. However, a 'start' and 'end' for "2017-05-17" and "2017-06-11" respectively is missing.

> trial_start_data <- data.frame(stringsAsFactors=FALSE,
+                          Person = c(1, 1, 1, 1),
+                          Event = c(5,6,7,8),
+                          Start_Date = as.Date(c("24/04/2017","09/05/2017","12/06/2017","21/06/2017"), "%d/%m/%Y"),
+                          End_Date = as.Date(c("28/09/2017","16/05/2017","21/06/2017","25/06/2017"), "%d/%m/%Y")
+ )
> 
> trial_start_data
  Person Event Start_Date   End_Date
1      1     5 2017-04-24 2017-09-28
2      1     6 2017-05-09 2017-05-16
3      1     7 2017-06-12 2017-06-21
4      1     8 2017-06-21 2017-06-25

> disjoint_subsets(trial_start_data$Start_Date, trial_start_data$End_Date)
       start        end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-06-12 2017-06-20
4 2017-06-21 2017-06-21
5 2017-06-22 2017-06-25
6 2017-06-26 2017-09-28

Upvotes: 2

Views: 190

Answers (1)

ekoam
ekoam

Reputation: 8844

The main problem here is to find all disjoint subsets of a set of ranges in an efficient way. Consider this function

disjoint_subsets <- function(starts, ends) {
  t1 <- min(starts)
  starts <- as.integer(starts - t1)
  ends <- as.integer(ends - t1) + 2L
  nvec <- ends - starts + 1L
  x <- sequence(nvec, starts) * 10L
  ends <- cumsum(nvec); starts <- ends - nvec + 1L
  x[ends] <- x[ends] - 9L; x[starts] <- x[starts] + 9L
  x <- sort(unique(x))
  b <- which(x %% 10L > 0L)
  lb <- x[head(b[!(b + 1L) %in% b], -1L)]
  ub <- x[tail(b[!(b - 1L) %in% b], -1L)]
  lb <- (lb + 9L * (lb %% 10L < 2L) + 1L) %/% 10L
  ub <- (ub - 9L * (ub %% 10L > 8L) - 1L) %/% 10L
  data.frame(start = lb + t1 - 1L, end = ub + t1 - 1L)
}
Usage
> with(trial_start_data, disjoint_subsets(Start_Date, End_Date))
       start        end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-05-17 2017-06-11
4 2017-06-12 2017-06-20
5 2017-06-21 2017-06-21
6 2017-06-22 2017-06-25
7 2017-06-26 2017-09-28

However, the function is a bit slow. There is still room for improvement. It finds all disjoint subsets in about a second for a dataframe with 100k rows. For a dataframe with 1 million rows, it takes 10-15 seconds to run. See the benchmark

starts_e6 <- sample(Sys.Date() + -1000:1000, size = 1e6, T)
ends_e6 <- starts_e6 + sample.int(1000, 1e6, T)
starts_e5 <- sample(Sys.Date() + -1000:1000, size = 1e5, T)
ends_e5 <- starts_e5 + sample.int(1000, 1e5, T)
microbenchmark::microbenchmark(
  disjoint_subsets(starts_e6, ends_e6),
  disjoint_subsets(starts_e5, ends_e5), 
  times = 2L
)

Unit: milliseconds
                                 expr      min       lq      mean    median        uq       max neval cld
 disjoint_subsets(starts_e6, ends_e6) 11299.59 11299.59 11366.623 11366.623 11433.652 11433.652     2   b
 disjoint_subsets(starts_e5, ends_e5)   873.66   873.66  1028.057  1028.057  1182.455  1182.455     2  a 

base::unique is the bottleneck here. However, if we can somehow use fewer elements to represent an interval, then we can save a lot of time.

The rest is just a piece of cake. You can use data.table::foverlap to perform a non-equal join of start_data and all the disjoint subsets. Then, summarise the joined data.table to get the end_data you want. For example,

library(data.table)

setDT(trial_start_data)[, c("Start_Date", "End_Date") := lapply(.SD, as.Date, "%d/%m/%Y"), .SDcols = c("Start_Date", "End_Date")]
dsubs = trial_start_data[, disjoint_subsets(Start_Date, End_Date)]; setDT(dsubs)

setkey(dsubs, start, end)
setkey(trial_start_data, Start_Date, End_Date)

foverlaps(dsubs, trial_start_data, type = "within")

Output

    Person Event Start_Date   End_Date      start        end
 1:      1     5 2017-04-24 2017-09-28 2017-04-24 2017-05-08
 2:      1     5 2017-04-24 2017-09-28 2017-05-09 2017-05-16
 3:      1     6 2017-05-09 2017-05-16 2017-05-09 2017-05-16
 4:      1     5 2017-04-24 2017-09-28 2017-05-17 2017-06-11
 5:      1     5 2017-04-24 2017-09-28 2017-06-12 2017-06-20
 6:      1     7 2017-06-12 2017-06-21 2017-06-12 2017-06-20
 7:      1     5 2017-04-24 2017-09-28 2017-06-21 2017-06-21
 8:      1     7 2017-06-12 2017-06-21 2017-06-21 2017-06-21
 9:      1     8 2017-06-21 2017-06-25 2017-06-21 2017-06-21
10:      1     5 2017-04-24 2017-09-28 2017-06-22 2017-06-25
11:      1     8 2017-06-21 2017-06-25 2017-06-22 2017-06-25
12:      1     5 2017-04-24 2017-09-28 2017-06-26 2017-09-28

A dataframe with just 100k rows should not be a problem for any data.table function. This is also by far the most efficient way I can think of. I will leave out the remaining steps as the answer is pretty long now. Also, I think they were covered by my answer to one of your previous posts.

Upvotes: 1

Related Questions