hvgupta
hvgupta

Reputation: 199

Collapse/Combine Date Ranges by Grouped Rows in R

I have a dataframe that looks like this:

ID    Group    Start Date    End Date
1     A        2018-08-08    2018-08-09
1     A        2018-08-07    2018-08-08
1     A        2018-08-05    2018-08-07
1     B        2018-08-08    2018-08-09
1     B        2018-08-07    2018-08-08
2     A        2018-08-08    2018-08-09
2     A        2018-08-07    2018-08-08
2     A        2018-08-01    2018-08-07
2     B        2018-08-08    2018-08-09
2     B        2018-08-07    2018-08-08
3     B        2018-08-07    2018-08-08
4     B        2018-08-07    2018-08-08
4     B        2018-08-01    2018-08-07

And I'd like collapse it so that if the Start Date variable in one row matches the End Date variable in the next row, the rows are combined while grouping by the ID and Group variables:

ID    Group    Start Date    End Date
1     A        2018-08-05    2018-08-09
1     B        2018-08-07    2018-08-09
2     A        2018-08-01    2018-08-09
2     B        2018-08-07    2018-08-09
3     B        2018-08-07    2018-08-08
4     B        2018-08-01    2018-08-08

In dplyr, I know it's possible to do this through something similar to the following:

df %>% 
group_by(ID, Group) %>%
rowwise() %>%
do(somefunction(x){})

But I need help in writing the function. Or if there's a for loop or other implementation that could accomplish this, it would be much appreciated.

Edit I've modified the example data for more clarity.

Upvotes: 0

Views: 1511

Answers (1)

hvgupta
hvgupta

Reputation: 199

I've found a more flexible solution using the IRanges package from BioConducter. This works well for cases which are not necessarily consecutive to their neighbors:

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
library(data.table)
library(lubridate)

setDT(df)

df[, , as.data.table(reduce(
       IRanges(as.numeric(`Start Date`),
               as.numeric(`End Date`)),
               min.gapwidth=1L))[, lapply(.SD, as_date), .SDcols = -"width"], 
                                 .(`ID`, `Group`)]

In the snippet above, I'm using the reduce function from IRanges on each of the date ranges, which collapses adjacent ranges within a given tolerance, min.gapwidth. In this case, I've specified date ranges to be combined if they're within one day of each other (min.gapwidth=1L). You can set this to 0L to combine strictly consecutive dates.

IRanges only takes numeric values, so I convert the dates to numeric and then I apply the as_date function from the lubridate package to convert them back to dates. I've also removed the width column which would tell me how many days are within each range.

Upvotes: 1

Related Questions