Jonno Bourne
Jonno Bourne

Reputation: 1981

How to flatten / merge overlapping time periods

I have a large data set of time periods, defined by a 'start' and and an 'end' column. Some of the periods overlap.

I would like to combine (flatten / merge / collapse) all overlapping time periods to have one 'start' value and one 'end' value.

Some example data:

  ID      start        end
1  A 2013-01-01 2013-01-05
2  A 2013-01-01 2013-01-05
3  A 2013-01-02 2013-01-03
4  A 2013-01-04 2013-01-06
5  A 2013-01-07 2013-01-09
6  A 2013-01-08 2013-01-11
7  A 2013-01-12 2013-01-15

Desired result:

  ID      start        end
1  A 2013-01-01 2013-01-06
2  A 2013-01-07 2013-01-11
3  A 2013-01-12 2013-01-15

What I have tried:

  require(dplyr)
  data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"), 
    start = structure(c(1356998400, 1356998400, 1357084800, 1357257600, 
    1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200, 
    1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct", 
    "POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA, 
-7L), class = "data.frame")

remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)  
}
data2 <- na.omit(data2)}

data <- remove.overlaps(data)

Upvotes: 22

Views: 8569

Answers (6)

jblood94
jblood94

Reputation: 16981

Benchmarks along with an even faster data.table solution

First, I echo @enmyj and @zach that the solution in the accepted answer gives erroneous results when there is one range that is completely inside another.

A faster approach that is reminiscent of the one proposed in the accepted answer:

  1. Sort by ID and then all dates (start and end combined).
  2. Subtract the cumulative sum of number of start dates by the cumulative sum of number of end dates.
  3. Find the indices where this sum is 0. The dates on these rows are the end dates of each collection of overlapping date ranges. The dates on the next row are the start dates of the next collection of overlapping date ranges. The indices can also be used to easily perform roll-up calculations of other columns.

This involves just a few vectorized calls and no grouping operations, so it is very performant.

As a function:

flatten <- function(dt) {
  setorder(dt[, rbindlist(.(.(ID, start, 1L), .(ID, end, -1L)))], V1, V2)[
    , .(
      ID = V1[i <- which(!cumsum(V3))],
      start = V2[c(1L, i[-length(i)] + 1L)],
      end = V2[i]
    )
  ]
}

Benchmarks

The benchmarking uses a large-ish data.table.

library(data.table)
library(dplyr)
library(ivs)

data <- data.table(
  ID = sample(1e3, 1e5, 1),
  start = as.Date(sample(1e4:2e4, 1e5, 1), origin = "1970-01-01")
)[, end := start + sample(100)]

fCum <- function(dt) {
  # adapted from https://stackoverflow.com/a/47337684/9463489
  dt %>%
    arrange(ID, start) %>%
    group_by(ID) %>%
    mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                                cummax(as.numeric(end)))[-n()])) %>%
    group_by(ID, indx) %>%
    reframe(start = min(start), end = max(end)) %>%
    select(-indx)
}

fivs <- function(dt) {
  # adapted from https://stackoverflow.com/a/71754454/9463489
  dt %>%
    mutate(interval = iv(start, end), .keep = "unused") %>%
    group_by(ID) %>%
    reframe(interval = iv_groups(interval)) %>%
    mutate(start = iv_start(interval), end = iv_end(interval)) %>%
    select(-interval)
}

squish <- function(dt) {
  # adapted from https://stackoverflow.com/a/53890653/9463489
  setkey(dt, ID, start, end)
  dt[,.(START_DT = start, 
        END_DT = end, 
        indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-.N])),
     keyby=ID
  ][,.(start=min(START_DT), 
       end = max(END_DT)),
    by=c("ID","indx")
  ][, indx := NULL]
}

Timings:

microbenchmark::microbenchmark(
  flatten = flatten(dt),
  fCum = setDT(fCum(dt)),
  fivs = setDT(fivs(dt)),
  squish = squish(dt),
  times = 10,
  check = "equal",
  setup = {dt <- copy(data)}
)
#> Unit: milliseconds
#>     expr       min        lq       mean     median        uq       max neval
#>  flatten   11.4732   11.8141   13.86760   12.36580   15.9228   19.1775    10
#>     fCum 1827.1197 1876.7701 1898.24285 1908.88640 1926.6548 1939.2919    10
#>     fivs  160.2568  163.9617  173.31783  173.32095  177.3789  192.7755    10
#>   squish   62.5197   64.9126   66.26047   65.08515   67.1685   70.9916    10

Aggregating other columns

The approach used by flatten also makes it easy to aggregate other columns in the data.table.

data[, v := runif(1e5)]

setorder(data[, rbindlist(.(.(ID, start, 1L, 0), .(ID, end, -1L, v)))], V1, V2)[
  , .(
    ID = V1[i <- which(!cumsum(V3))],
    start = V2[c(1L, i[-length(i)] + 1L)],
    end = V2[i],
    v = diff(c(0, cumsum(V4)[i]))
  )
]
#>          ID      start        end          v
#>     1:    1 1997-09-25 1997-09-27 0.40898255
#>     2:    1 1997-11-09 1997-11-30 0.44067634
#>     3:    1 1998-04-27 1998-07-17 1.73142460
#>     4:    1 1999-08-05 1999-11-05 0.41103832
#>     5:    1 1999-12-09 2000-01-26 0.90639735
#>    ---                                      
#> 60286: 1000 2023-01-06 2023-03-28 0.54727106
#> 60287: 1000 2023-07-20 2023-10-16 1.74270130
#> 60288: 1000 2024-03-24 2024-06-23 0.07110824
#> 60289: 1000 2024-07-13 2024-07-31 0.63888263
#> 60290: 1000 2024-10-02 2024-10-19 0.22872167

Upvotes: 2

Davis Vaughan
Davis Vaughan

Reputation: 2950

I think that you can solve this problem pretty nicely with dplyr and the ivs package, which is designed for working with interval vectors, exactly like what you have here. It is inspired by IRanges, but is more suitable for use in the tidyverse and is completely generic so it can handle date intervals automatically (no need to convert to numeric and back).

The key is to combine the start/end boundaries into a single interval vector column, and then use iv_groups(). This merges all of the overlapping intervals in the interval vector and returns the intervals that remain after the overlaps have been merged.

It seems like you want to do this by ID, so I've also grouped by ID.

library(ivs)
library(dplyr)

data <- tribble(
  ~ID,       ~start,         ~end,
  "A", "2013-01-01", "2013-01-05",
  "A", "2013-01-01", "2013-01-05",
  "A", "2013-01-02", "2013-01-03",
  "A", "2013-01-04", "2013-01-06",
  "A", "2013-01-07", "2013-01-09",
  "A", "2013-01-08", "2013-01-11",
  "A", "2013-01-12", "2013-01-15"
) %>%
  mutate(
    start = as.Date(start),
    end = as.Date(end)
  )

data
#> # A tibble: 7 × 3
#>   ID    start      end       
#>   <chr> <date>     <date>    
#> 1 A     2013-01-01 2013-01-05
#> 2 A     2013-01-01 2013-01-05
#> 3 A     2013-01-02 2013-01-03
#> 4 A     2013-01-04 2013-01-06
#> 5 A     2013-01-07 2013-01-09
#> 6 A     2013-01-08 2013-01-11
#> 7 A     2013-01-12 2013-01-15

# Combine `start` and `end` into a single interval vector column
data <- data %>%
  mutate(interval = iv(start, end), .keep = "unused")

# Note that this is a half-open interval!
data  
#> # A tibble: 7 × 2
#>   ID                    interval
#>   <chr>               <iv<date>>
#> 1 A     [2013-01-01, 2013-01-05)
#> 2 A     [2013-01-01, 2013-01-05)
#> 3 A     [2013-01-02, 2013-01-03)
#> 4 A     [2013-01-04, 2013-01-06)
#> 5 A     [2013-01-07, 2013-01-09)
#> 6 A     [2013-01-08, 2013-01-11)
#> 7 A     [2013-01-12, 2013-01-15)

# It seems like you'd want to group by ID, so lets do that.
# Then we use `iv_groups()` which merges all overlapping intervals and returns
# the intervals that remain after all the overlaps have been merged
data %>%
  group_by(ID) %>%
  summarise(interval = iv_groups(interval), .groups = "drop")
#> # A tibble: 3 × 2
#>   ID                    interval
#>   <chr>               <iv<date>>
#> 1 A     [2013-01-01, 2013-01-06)
#> 2 A     [2013-01-07, 2013-01-11)
#> 3 A     [2013-01-12, 2013-01-15)

Created on 2022-04-05 by the reprex package (v2.0.1)

Upvotes: 5

David Arenburg
David Arenburg

Reputation: 92292

Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups

data %>%
  arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
# 
#   ID indx      start        end
# 1  A    0 2013-01-01 2013-01-06
# 2  A    1 2013-01-07 2013-01-11
# 3  A    2 2013-01-12 2013-01-15

Upvotes: 22

zack
zack

Reputation: 5405

@David Arenburg's answer is great - but I ran into an issue where an earlier interval ended after a later interval - but using last in the summarise call resulted in the wrong end date. I'd suggest changing first(start) and last(end) to min(start) and max(end)

data %>%
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = min(start), end = max(end))

Also, as @Jonno Bourne mentioned, sorting by start and any grouping variables is important before applying the method.

Upvotes: 18

enmyj
enmyj

Reputation: 401

It looks like I'm a little late to the party, but I took @zach's code and re-wrote it using data.table below. I didn't do comprehensive testing, but this seemed to run about 20% faster than the tidy version. (I couldn't test the IRange method because the package is not yet available for R 3.5.1)

Also, fwiw, the accepted answer doesn't capture the edge case in which one date range is totally within another (e.g., 2018-07-07 to 2017-07-14 is within 2018-05-01 to 2018-12-01). @zach's answer does capture that edge case.

library(data.table)

start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")

# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]

# set keys (also orders)
setkey(d2, ID, start_col, end_col)

# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col, 
                  END_DT = end_col, 
                  indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
               keyby=ID
               ][,.(start=min(START_DT), 
                    end = max(END_DT)),
                 by=c("ID","indx")
                 ]

Upvotes: 2

Uwe
Uwe

Reputation: 42544

For the sake of completeness, the IRanges package on Bioconductor has some neat functions which can be used to deal with date or date time ranges. One of it is the reduce() function which merges overlapping or adjacent ranges.

However, there is a drawback because IRanges works on integer ranges (hence the name), so the convenience of using IRanges functions comes at the expense of converting Date or POSIXct objects to and fro.

Also, it seems that dplyr doesn't play well with IRanges (at least judged by my limited experience with dplyr) so I use data.table:

library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)

setDT(data)[, {
  ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
  .(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
       ID      start        end
   <fctr>     <POSc>     <POSc>
1:      A 2013-01-01 2013-01-06
2:      A 2013-01-07 2013-01-11
3:      A 2013-01-12 2013-01-15

A code variant is

setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
  , lapply(.SD, as_datetime), .SDcols = -"width"], 
  by = ID]

In both variants the as_datetime() from the lubridate packages is used which spares to specify the origin when converting numbers to POSIXct objects.

Would be interesting to see a benchmark comparision of the IRanges approaches vs David's answer.

Upvotes: 6

Related Questions