Reputation: 234
I have a data frame that looks like this:
w<-read.table(header=TRUE,text="
start.date end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")
I'm trying to get an output that will combine overlapping start and end dates into one date range. So for the example set, I'd like to get:
w<-read.table(header=TRUE,text="
start.date end.date
2006-06-26 2006-08-16
2007-06-09 2007-07-31
2007-08-04 2007-09-04
2007-09-05 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-08-20")
The question is similar to Date roll-up in R, but I don't need to do any sort of group by on mine, so the answer there is confusing.
Also, the code that was suggested in response to my question below does not work for certain parts of my data frame such as:
x<-read.table(header=TRUE,text="start.date end.date
2006-01-19 2006-01-20
2006-01-25 2006-01-29
2006-02-24 2006-02-25
2006-03-15 2006-03-22
2006-04-29 2006-04-30
2006-05-24 2006-05-25
2006-06-26 2006-08-16
2006-07-05 2006-07-10
2006-07-12 2006-07-21
2006-08-13 2006-08-15
2006-08-18 2006-08-19
2006-08-28 2006-09-02")
I am confused why it does not?
Upvotes: 3
Views: 4944
Reputation: 7540
For anyone referring back to this older question, here's a newer option using a package dedicated to working with intervals:
library(tidyverse)
library(ivs)
w <- read.table(header = TRUE, text = "
start.date end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")
w |>
mutate(iv = iv(start.date, end.date)) |>
summarise(iv = iv_groups(iv), .groups = "drop")
#> iv
#> 1 [2006-06-26, 2006-08-16)
#> 2 [2007-06-09, 2007-07-31)
#> 3 [2007-08-04, 2007-09-04)
#> 4 [2007-09-05, 2007-10-12)
#> 5 [2007-10-19, 2007-11-16)
#> 6 [2007-11-17, 2007-12-15)
#> 7 [2008-06-18, 2008-08-20)
Created on 2022-05-27 by the reprex package (v2.0.1)
Upvotes: 0
Reputation: 42544
The IRanges
package on Bioconductor includes the function reduce
which can be utilized to combine overlapping start and end dates into one date range.
IRanges
works on integer ranges so you have to convert the data from class Date
to integer
and back. This can be wrapped up in a function:
collapse_date_ranges <- function(w, min.gapwidth = 1L) {
library(data.table)
library(magrittr)
IRanges::IRanges(start = as.integer(as.Date(w$start.date)),
end = as.integer(as.Date(w$end.date))) %>%
IRanges::reduce(min.gapwidth = min.gapwidth) %>%
as.data.table() %>%
.[, lapply(.SD, lubridate::as_date),
.SDcols = c("start", "end")]
}
collapse_date_ranges(w, 0L)
# start end
#1: 2006-06-26 2006-08-16
#2: 2007-06-09 2007-07-31
#3: 2007-08-04 2007-09-04
#4: 2007-09-05 2007-10-12
#5: 2007-10-19 2007-11-16
#6: 2007-11-17 2007-12-15
#7: 2008-06-18 2008-08-20
collapse_date_ranges(x, 0L)
# start end
#1: 2006-01-19 2006-01-20
#2: 2006-01-25 2006-01-29
#3: 2006-02-24 2006-02-25
#4: 2006-03-15 2006-03-22
#5: 2006-04-29 2006-04-30
#6: 2006-05-24 2006-05-25
#7: 2006-06-26 2006-08-16
#8: 2006-08-18 2006-08-19
#9: 2006-08-28 2006-09-02
::
to access single functions from the IRanges
package over using library(IRanges)
which loads the whole package.as.Date
is just to ensure the proper class) and create an IRanges
object.reduce
does all the hard work. The parameter min.gapwidth
is required here as reduce
collapses adjacent ranges by default (see below).dplyr
instead of data.table
as well.) w
and x
. x
includes a special case where one date range embeds other date ranges to full extent.The sample result given by the OP shows that adjacent data ranges should not be collapsed, e.g., the range 2007-10-19
to 2007-11-16
is separate from the range 2007-11-17
to 2007-12-15
although the second range starts only one day after the first one has ended.
Just in case, adjacent date ranges are to be collapsed this can be achieved by using the default value of the min.gapwidth
parameter:
collapse_date_ranges(w)
# start end
#1: 2006-06-26 2006-08-16
#2: 2007-06-09 2007-07-31
#3: 2007-08-04 2007-10-12
#4: 2007-10-19 2007-12-15
#5: 2008-06-18 2008-08-20
Upvotes: 4
Reputation: 23101
Try this:
w[] <- lapply(w, function(x) as.Date(x, '%Y-%m-%d'))
w <- w[order(w$start.date),] # sort the data by start dates if already not sorted
w$group <- 1:nrow(w) # common intervals should belong to same group
merge.indices <- lapply(2:nrow(w), function(x) {
indices <- which(findInterval(w$end.date[1:(x-1)], w$start.date[x])==1)
if (length(indices) > 0) indices <- c(indices, x)
indices})
# assign the intervals the right groups
for (i in 1:length(merge.indices)) {
if (length(merge.indices[[i]]) > 0) {
w$group[merge.indices[[i]]] <- min(w$group[merge.indices[[i]]])
}
}
do.call(rbind, lapply(split(w, w$group), function(x) data.frame(start.date=min(x[,1]), end.date=max(x[,2]))))
It conceptually merges overlapping intervals into the same group as shown below:
with output:
start.date end.date
1 2006-01-19 2006-01-20
2 2006-01-25 2006-01-29
3 2006-02-24 2006-02-25
4 2006-03-15 2006-03-22
5 2006-04-29 2006-04-30
6 2006-05-24 2006-05-25
7 2006-06-26 2006-08-16
11 2006-08-18 2006-08-19
12 2006-08-28 2006-09-02
Upvotes: 1
Reputation: 95
Solution.
w<-read.table(header=TRUE, stringsAsFactor=F, text="
start.date end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")
w <- data.frame(lapply(w, as.Date))
library(lubridate)
idx.rle <- rle(as.numeric(sapply(1:(nrow(w)-1), function(i) int_overlaps(interval(w[i,1],w[i,2]), interval(w[i+1,1],w[i+1,2])))))
i.starts <- nrow(w)-rev(cumsum(rev(idx.rle$length)))
i.ends <- 1+cumsum(idx.rle$length)
do.call(rbind,
lapply(1:length(idx.rle$lengths),
function(i) {
i.start <- i.starts[i]
i.end <- i.ends[i]
if(idx.rle$values[i]==1) {
d <- data.frame(start.date=w[i.start,1],
end.date=max(w[i.start:i.end,2]) );
names(d) <- names(w);
d
} else {
if(idx.rle$lengths[i]>1&i>1&i<length(idx.rle$lengths)) {
data.frame(w[(i.start+1):(i.end-1),] )
} else {
if (idx.rle$lengths[i]>=1&i==1) {
data.frame(w[(i.start):(i.end-1),])
} else {
if(idx.rle$lengths[i]>=1&i==length(idx.rle$lengths)) data.frame(w[(i.start+1):(i.end),] )
}
}
}
}))
Upvotes: 0