Reputation: 387
I have a dataframe in this format
Account ID, Start Date, End Date
1 , 2016-01-01, 2016-02-01
1 , 2016-02-02, 2016-03-01
1 , 2016-03-01, 2016-04-01
2 , 2016-01-01, 2016-02-01
2 , 2016-03-02, 2016-03-20
2 , 2016-03-21, 2016-04-01
I want the resulting dataframe to look like.
Account ID, Start Date, End Date
1 , 2016-01-01, 2016-04-01
2 , 2016-01-01, 2016-02-01
2 , 2016-03-02, 2016-04-01
Such that if there's less than 7 days between an End Date and a subsequent Start Date for an account, it will merge those into one and use the End Date of the latter record and the start date of the former record.
I've experimented grouping with Lead and Lag with dplyr but that wouldn't work for accounts with 3 or more records.
In the example,
Account ID 1 is a case where it would be solved with grouping by accountID, and max, min would work
but Account ID 2 is a case where that would not work.
Any help is really appreciated.
Upvotes: 0
Views: 375
Reputation: 160872
Your data:
dat <- read.table(text = "AccountID StartDate EndDate
1 2016-01-01 2016-02-01
1 2016-02-02 2016-03-01
1 2016-03-01 2016-04-01
2 2016-01-01 2016-02-01
2 2016-03-02 2016-03-20
2 2016-03-21 2016-04-01", header = TRUE, stringsAsFactors = FALSE)
dat[2:3] <- lapply(dat[2:3], as.Date)
You can use lag
after grouping:
library(dplyr)
group_by(dat, AccountID) %>%
mutate(
week = cumsum(StartDate - lag(EndDate, default = 0) > 7)
) %>%
group_by(AccountID, week) %>%
summarize(
StartDate = min(StartDate),
EndDate = max(EndDate)
) %>%
ungroup()
# # A tibble: 3 × 4
# AccountID week StartDate EndDate
# <int> <int> <date> <date>
# 1 1 1 2016-01-01 2016-04-01
# 2 2 1 2016-01-01 2016-02-01
# 3 2 2 2016-03-02 2016-04-01
Upvotes: 2