Reputation: 597
I have this data:
start_data <- data.frame(stringsAsFactors=FALSE,
Person = c(1, 1, 1, 1),
Event = c(1, 2, 3, 4),
Var1 = c(1, 2, 3, 5),
Var2 = c(7, 8, 9, 6),
Var3 = c(13, 14, 15, 7),
Start_Date = c("1/01/2020", "5/01/2020", "21/01/2020", "23/01/2020"),
End_Date = c("10/01/2020", "20/01/2020", "30/01/2020", "25/01/2020")
)
start_data
Person Event Var1 Var2 Var3 Start_Date End_Date
1 1 1 1 7 13 1/01/2020 10/01/2020
2 1 2 2 8 14 5/01/2020 20/01/2020
3 1 3 3 9 15 21/01/2020 30/01/2020
4 1 4 5 6 7 23/01/2020 25/01/2020
And I want to convert it into this:
end_data <- data.frame(stringsAsFactors=FALSE,
Person = c(1, 1, 1, 1, 1, 1),
Event = c("1", "1 AND 2", "2", "3", "3 AND 4", "3"),
Var1 = c("1", "1 AND 2", "2", "3", "3 AND 5", "3"),
Var2 = c("7", "7 AND 8", "8", "9", "9 AND 6", "9"),
Var3 = c(13, 14, 14, 15, 15, 15),
Start_Date = c("1/01/2020", "5/01/2020", "11/01/2020", "21/01/2020",
"23/01/2020", "26/01/2020"),
End_Date = c("4/01/2020", "10/01/2020", "20/01/2020", "22/01/2020",
"25/01/2020", "30/01/2020")
)
end_data
Person Event Var1 Var2 Var3 Start_Date End_Date
1 1 1 1 7 13 1/01/2020 4/01/2020
2 1 1 AND 2 1 AND 2 7 AND 8 14 5/01/2020 10/01/2020
3 1 2 2 8 14 11/01/2020 20/01/2020
4 1 3 3 9 15 21/01/2020 22/01/2020
5 1 3 AND 4 3 AND 5 9 AND 6 15 23/01/2020 25/01/2020
6 1 3 3 9 15 26/01/2020 30/01/2020
The code should expand the data frame to isolate overlapping start_date and end_date ranges. When there is an overlapping range it should create a new row with the overlapping data. Therefore, when looking at the final table, there should be no Start_Date and End_Date ranges that overlap with one another. Further, the result of the table should be summarised for Event, Var1 and Var3 by concatenating the results. Var 3 should also be aggregated by taking the max value in the overlapping ranges.
Ideally, I want to apply this code to a number of "Persons" so playing nice with group_by() or nesting with dplyr would be preferential.
Edit:
In the case of 3 overlapping periods as per a question below. It would look like this;
Edit2:
The solution by @ekoam is very close. However, it does not deal with the below example. Event 5 encapsulates the whole period. Hence, there should be no missing ranges. However, a 'start' and 'end' for "2017-05-17" and "2017-06-11" respectively is missing.
> trial_start_data <- data.frame(stringsAsFactors=FALSE,
+ Person = c(1, 1, 1, 1),
+ Event = c(5,6,7,8),
+ Start_Date = as.Date(c("24/04/2017","09/05/2017","12/06/2017","21/06/2017"), "%d/%m/%Y"),
+ End_Date = as.Date(c("28/09/2017","16/05/2017","21/06/2017","25/06/2017"), "%d/%m/%Y")
+ )
>
> trial_start_data
Person Event Start_Date End_Date
1 1 5 2017-04-24 2017-09-28
2 1 6 2017-05-09 2017-05-16
3 1 7 2017-06-12 2017-06-21
4 1 8 2017-06-21 2017-06-25
> disjoint_subsets(trial_start_data$Start_Date, trial_start_data$End_Date)
start end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-06-12 2017-06-20
4 2017-06-21 2017-06-21
5 2017-06-22 2017-06-25
6 2017-06-26 2017-09-28
Upvotes: 2
Views: 190
Reputation: 8844
The main problem here is to find all disjoint subsets of a set of ranges in an efficient way. Consider this function
disjoint_subsets <- function(starts, ends) {
t1 <- min(starts)
starts <- as.integer(starts - t1)
ends <- as.integer(ends - t1) + 2L
nvec <- ends - starts + 1L
x <- sequence(nvec, starts) * 10L
ends <- cumsum(nvec); starts <- ends - nvec + 1L
x[ends] <- x[ends] - 9L; x[starts] <- x[starts] + 9L
x <- sort(unique(x))
b <- which(x %% 10L > 0L)
lb <- x[head(b[!(b + 1L) %in% b], -1L)]
ub <- x[tail(b[!(b - 1L) %in% b], -1L)]
lb <- (lb + 9L * (lb %% 10L < 2L) + 1L) %/% 10L
ub <- (ub - 9L * (ub %% 10L > 8L) - 1L) %/% 10L
data.frame(start = lb + t1 - 1L, end = ub + t1 - 1L)
}
Usage
> with(trial_start_data, disjoint_subsets(Start_Date, End_Date))
start end
1 2017-04-24 2017-05-08
2 2017-05-09 2017-05-16
3 2017-05-17 2017-06-11
4 2017-06-12 2017-06-20
5 2017-06-21 2017-06-21
6 2017-06-22 2017-06-25
7 2017-06-26 2017-09-28
However, the function is a bit slow. There is still room for improvement. It finds all disjoint subsets in about a second for a dataframe with 100k rows. For a dataframe with 1 million rows, it takes 10-15 seconds to run. See the benchmark
starts_e6 <- sample(Sys.Date() + -1000:1000, size = 1e6, T)
ends_e6 <- starts_e6 + sample.int(1000, 1e6, T)
starts_e5 <- sample(Sys.Date() + -1000:1000, size = 1e5, T)
ends_e5 <- starts_e5 + sample.int(1000, 1e5, T)
microbenchmark::microbenchmark(
disjoint_subsets(starts_e6, ends_e6),
disjoint_subsets(starts_e5, ends_e5),
times = 2L
)
Unit: milliseconds
expr min lq mean median uq max neval cld
disjoint_subsets(starts_e6, ends_e6) 11299.59 11299.59 11366.623 11366.623 11433.652 11433.652 2 b
disjoint_subsets(starts_e5, ends_e5) 873.66 873.66 1028.057 1028.057 1182.455 1182.455 2 a
base::unique
is the bottleneck here. However, if we can somehow use fewer elements to represent an interval, then we can save a lot of time.
The rest is just a piece of cake. You can use data.table::foverlap
to perform a non-equal join of start_data
and all the disjoint subsets. Then, summarise the joined data.table to get the end_data
you want. For example,
library(data.table)
setDT(trial_start_data)[, c("Start_Date", "End_Date") := lapply(.SD, as.Date, "%d/%m/%Y"), .SDcols = c("Start_Date", "End_Date")]
dsubs = trial_start_data[, disjoint_subsets(Start_Date, End_Date)]; setDT(dsubs)
setkey(dsubs, start, end)
setkey(trial_start_data, Start_Date, End_Date)
foverlaps(dsubs, trial_start_data, type = "within")
Output
Person Event Start_Date End_Date start end
1: 1 5 2017-04-24 2017-09-28 2017-04-24 2017-05-08
2: 1 5 2017-04-24 2017-09-28 2017-05-09 2017-05-16
3: 1 6 2017-05-09 2017-05-16 2017-05-09 2017-05-16
4: 1 5 2017-04-24 2017-09-28 2017-05-17 2017-06-11
5: 1 5 2017-04-24 2017-09-28 2017-06-12 2017-06-20
6: 1 7 2017-06-12 2017-06-21 2017-06-12 2017-06-20
7: 1 5 2017-04-24 2017-09-28 2017-06-21 2017-06-21
8: 1 7 2017-06-12 2017-06-21 2017-06-21 2017-06-21
9: 1 8 2017-06-21 2017-06-25 2017-06-21 2017-06-21
10: 1 5 2017-04-24 2017-09-28 2017-06-22 2017-06-25
11: 1 8 2017-06-21 2017-06-25 2017-06-22 2017-06-25
12: 1 5 2017-04-24 2017-09-28 2017-06-26 2017-09-28
A dataframe with just 100k rows should not be a problem for any data.table function. This is also by far the most efficient way I can think of. I will leave out the remaining steps as the answer is pretty long now. Also, I think they were covered by my answer to one of your previous posts.
Upvotes: 1