Reputation: 1981
I have a large data set of time periods, defined by a 'start' and and an 'end' column. Some of the periods overlap.
I would like to combine (flatten / merge / collapse) all overlapping time periods to have one 'start' value and one 'end' value.
Some example data:
ID start end
1 A 2013-01-01 2013-01-05
2 A 2013-01-01 2013-01-05
3 A 2013-01-02 2013-01-03
4 A 2013-01-04 2013-01-06
5 A 2013-01-07 2013-01-09
6 A 2013-01-08 2013-01-11
7 A 2013-01-12 2013-01-15
Desired result:
ID start end
1 A 2013-01-01 2013-01-06
2 A 2013-01-07 2013-01-11
3 A 2013-01-12 2013-01-15
What I have tried:
require(dplyr)
data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"),
start = structure(c(1356998400, 1356998400, 1357084800, 1357257600,
1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct",
"POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200,
1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA,
-7L), class = "data.frame")
remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)
}
data2 <- na.omit(data2)}
data <- remove.overlaps(data)
Upvotes: 22
Views: 8569
Reputation: 16981
data.table
solutionFirst, I echo @enmyj and @zach that the solution in the accepted answer gives erroneous results when there is one range that is completely inside another.
A faster approach that is reminiscent of the one proposed in the accepted answer:
ID
and then all dates (start
and end
combined).0
. The dates on these rows are the end dates of each collection of overlapping date ranges. The dates on the next row are the start dates of the next collection of overlapping date ranges. The indices can also be used to easily perform roll-up calculations of other columns.This involves just a few vectorized calls and no grouping operations, so it is very performant.
As a function:
flatten <- function(dt) {
setorder(dt[, rbindlist(.(.(ID, start, 1L), .(ID, end, -1L)))], V1, V2)[
, .(
ID = V1[i <- which(!cumsum(V3))],
start = V2[c(1L, i[-length(i)] + 1L)],
end = V2[i]
)
]
}
The benchmarking uses a large-ish data.table
.
library(data.table)
library(dplyr)
library(ivs)
data <- data.table(
ID = sample(1e3, 1e5, 1),
start = as.Date(sample(1e4:2e4, 1e5, 1), origin = "1970-01-01")
)[, end := start + sample(100)]
fCum <- function(dt) {
# adapted from https://stackoverflow.com/a/47337684/9463489
dt %>%
arrange(ID, start) %>%
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
reframe(start = min(start), end = max(end)) %>%
select(-indx)
}
fivs <- function(dt) {
# adapted from https://stackoverflow.com/a/71754454/9463489
dt %>%
mutate(interval = iv(start, end), .keep = "unused") %>%
group_by(ID) %>%
reframe(interval = iv_groups(interval)) %>%
mutate(start = iv_start(interval), end = iv_end(interval)) %>%
select(-interval)
}
squish <- function(dt) {
# adapted from https://stackoverflow.com/a/53890653/9463489
setkey(dt, ID, start, end)
dt[,.(START_DT = start,
END_DT = end,
indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-.N])),
keyby=ID
][,.(start=min(START_DT),
end = max(END_DT)),
by=c("ID","indx")
][, indx := NULL]
}
Timings:
microbenchmark::microbenchmark(
flatten = flatten(dt),
fCum = setDT(fCum(dt)),
fivs = setDT(fivs(dt)),
squish = squish(dt),
times = 10,
check = "equal",
setup = {dt <- copy(data)}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> flatten 11.4732 11.8141 13.86760 12.36580 15.9228 19.1775 10
#> fCum 1827.1197 1876.7701 1898.24285 1908.88640 1926.6548 1939.2919 10
#> fivs 160.2568 163.9617 173.31783 173.32095 177.3789 192.7755 10
#> squish 62.5197 64.9126 66.26047 65.08515 67.1685 70.9916 10
The approach used by flatten
also makes it easy to aggregate other columns in the data.table
.
data[, v := runif(1e5)]
setorder(data[, rbindlist(.(.(ID, start, 1L, 0), .(ID, end, -1L, v)))], V1, V2)[
, .(
ID = V1[i <- which(!cumsum(V3))],
start = V2[c(1L, i[-length(i)] + 1L)],
end = V2[i],
v = diff(c(0, cumsum(V4)[i]))
)
]
#> ID start end v
#> 1: 1 1997-09-25 1997-09-27 0.40898255
#> 2: 1 1997-11-09 1997-11-30 0.44067634
#> 3: 1 1998-04-27 1998-07-17 1.73142460
#> 4: 1 1999-08-05 1999-11-05 0.41103832
#> 5: 1 1999-12-09 2000-01-26 0.90639735
#> ---
#> 60286: 1000 2023-01-06 2023-03-28 0.54727106
#> 60287: 1000 2023-07-20 2023-10-16 1.74270130
#> 60288: 1000 2024-03-24 2024-06-23 0.07110824
#> 60289: 1000 2024-07-13 2024-07-31 0.63888263
#> 60290: 1000 2024-10-02 2024-10-19 0.22872167
Upvotes: 2
Reputation: 2950
I think that you can solve this problem pretty nicely with dplyr and the ivs package, which is designed for working with interval vectors, exactly like what you have here. It is inspired by IRanges, but is more suitable for use in the tidyverse and is completely generic so it can handle date intervals automatically (no need to convert to numeric and back).
The key is to combine the start/end boundaries into a single interval vector column, and then use iv_groups()
. This merges all of the overlapping intervals in the interval vector and returns the intervals that remain after the overlaps have been merged.
It seems like you want to do this by ID, so I've also grouped by ID.
library(ivs)
library(dplyr)
data <- tribble(
~ID, ~start, ~end,
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-02", "2013-01-03",
"A", "2013-01-04", "2013-01-06",
"A", "2013-01-07", "2013-01-09",
"A", "2013-01-08", "2013-01-11",
"A", "2013-01-12", "2013-01-15"
) %>%
mutate(
start = as.Date(start),
end = as.Date(end)
)
data
#> # A tibble: 7 × 3
#> ID start end
#> <chr> <date> <date>
#> 1 A 2013-01-01 2013-01-05
#> 2 A 2013-01-01 2013-01-05
#> 3 A 2013-01-02 2013-01-03
#> 4 A 2013-01-04 2013-01-06
#> 5 A 2013-01-07 2013-01-09
#> 6 A 2013-01-08 2013-01-11
#> 7 A 2013-01-12 2013-01-15
# Combine `start` and `end` into a single interval vector column
data <- data %>%
mutate(interval = iv(start, end), .keep = "unused")
# Note that this is a half-open interval!
data
#> # A tibble: 7 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-05)
#> 2 A [2013-01-01, 2013-01-05)
#> 3 A [2013-01-02, 2013-01-03)
#> 4 A [2013-01-04, 2013-01-06)
#> 5 A [2013-01-07, 2013-01-09)
#> 6 A [2013-01-08, 2013-01-11)
#> 7 A [2013-01-12, 2013-01-15)
# It seems like you'd want to group by ID, so lets do that.
# Then we use `iv_groups()` which merges all overlapping intervals and returns
# the intervals that remain after all the overlaps have been merged
data %>%
group_by(ID) %>%
summarise(interval = iv_groups(interval), .groups = "drop")
#> # A tibble: 3 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-06)
#> 2 A [2013-01-07, 2013-01-11)
#> 3 A [2013-01-12, 2013-01-15)
Created on 2022-04-05 by the reprex package (v2.0.1)
Upvotes: 5
Reputation: 92292
Here's a possible solution. The basic idea here is to compare lagged start
date with the maximum end date "until now" using the cummax
function and create an index that will separate the data into groups
data %>%
arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = first(start), end = last(end))
# Source: local data frame [3 x 4]
# Groups: ID
#
# ID indx start end
# 1 A 0 2013-01-01 2013-01-06
# 2 A 1 2013-01-07 2013-01-11
# 3 A 2 2013-01-12 2013-01-15
Upvotes: 22
Reputation: 5405
@David Arenburg's answer is great - but I ran into an issue where an earlier interval ended after a later interval - but using last
in the summarise
call resulted in the wrong end date. I'd suggest changing first(start)
and last(end)
to min(start)
and max(end)
data %>%
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = min(start), end = max(end))
Also, as @Jonno Bourne mentioned, sorting by start
and any grouping variables is important before applying the method.
Upvotes: 18
Reputation: 401
It looks like I'm a little late to the party, but I took @zach's code and re-wrote it using data.table
below. I didn't do comprehensive testing, but this seemed to run about 20% faster than the tidy
version. (I couldn't test the IRange
method because the package is not yet available for R 3.5.1)
Also, fwiw, the accepted answer doesn't capture the edge case in which one date range is totally within another (e.g., 2018-07-07
to 2017-07-14
is within 2018-05-01
to 2018-12-01
). @zach's answer does capture that edge case.
library(data.table)
start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")
# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]
# set keys (also orders)
setkey(d2, ID, start_col, end_col)
# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col,
END_DT = end_col,
indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
keyby=ID
][,.(start=min(START_DT),
end = max(END_DT)),
by=c("ID","indx")
]
Upvotes: 2
Reputation: 42544
For the sake of completeness, the IRanges
package on Bioconductor has some neat functions which can be used to deal with date or date time ranges. One of it is the reduce()
function which merges overlapping or adjacent ranges.
However, there is a drawback because IRanges
works on integer ranges (hence the name), so the convenience of using IRanges
functions comes at the expense of converting Date
or POSIXct
objects to and fro.
Also, it seems that dplyr
doesn't play well with IRanges
(at least judged by my limited experience with dplyr
) so I use data.table
:
library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)
setDT(data)[, {
ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
.(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
ID start end <fctr> <POSc> <POSc> 1: A 2013-01-01 2013-01-06 2: A 2013-01-07 2013-01-11 3: A 2013-01-12 2013-01-15
A code variant is
setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
, lapply(.SD, as_datetime), .SDcols = -"width"],
by = ID]
In both variants the as_datetime()
from the lubridate
packages is used which spares to specify the origin when converting numbers to POSIXct
objects.
Would be interesting to see a benchmark comparision of the IRanges
approaches vs David's answer.
Upvotes: 6