Reputation: 199
I have a dataframe that looks like this:
ID Group Start Date End Date
1 A 2018-08-08 2018-08-09
1 A 2018-08-07 2018-08-08
1 A 2018-08-05 2018-08-07
1 B 2018-08-08 2018-08-09
1 B 2018-08-07 2018-08-08
2 A 2018-08-08 2018-08-09
2 A 2018-08-07 2018-08-08
2 A 2018-08-01 2018-08-07
2 B 2018-08-08 2018-08-09
2 B 2018-08-07 2018-08-08
3 B 2018-08-07 2018-08-08
4 B 2018-08-07 2018-08-08
4 B 2018-08-01 2018-08-07
And I'd like collapse it so that if the Start Date
variable in one row matches the End Date
variable in the next row, the rows are combined while grouping by the ID
and Group
variables:
ID Group Start Date End Date
1 A 2018-08-05 2018-08-09
1 B 2018-08-07 2018-08-09
2 A 2018-08-01 2018-08-09
2 B 2018-08-07 2018-08-09
3 B 2018-08-07 2018-08-08
4 B 2018-08-01 2018-08-08
In dplyr, I know it's possible to do this through something similar to the following:
df %>%
group_by(ID, Group) %>%
rowwise() %>%
do(somefunction(x){})
But I need help in writing the function. Or if there's a for loop or other implementation that could accomplish this, it would be much appreciated.
Edit I've modified the example data for more clarity.
Upvotes: 0
Views: 1511
Reputation: 199
I've found a more flexible solution using the IRanges
package from BioConducter. This works well for cases which are not necessarily consecutive to their neighbors:
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
library(data.table)
library(lubridate)
setDT(df)
df[, , as.data.table(reduce(
IRanges(as.numeric(`Start Date`),
as.numeric(`End Date`)),
min.gapwidth=1L))[, lapply(.SD, as_date), .SDcols = -"width"],
.(`ID`, `Group`)]
In the snippet above, I'm using the reduce
function from IRanges
on each of the date ranges, which collapses adjacent ranges within a given tolerance, min.gapwidth
. In this case, I've specified date ranges to be combined if they're within one day of each other (min.gapwidth=1L
). You can set this to 0L
to combine strictly consecutive dates.
IRanges
only takes numeric values, so I convert the dates to numeric and then I apply the as_date
function from the lubridate
package to convert them back to dates. I've also removed the width
column which would tell me how many days are within each range.
Upvotes: 1