dplyr conditional grouping dates

Question

I have a dataframe in this format

Account ID, Start Date, End Date
      1   , 2016-01-01, 2016-02-01
      1   , 2016-02-02, 2016-03-01
      1   , 2016-03-01, 2016-04-01
      2   , 2016-01-01, 2016-02-01
      2   , 2016-03-02, 2016-03-20
      2   , 2016-03-21, 2016-04-01

I want the resulting dataframe to look like.

Account ID, Start Date, End Date
      1   , 2016-01-01, 2016-04-01
      2   , 2016-01-01, 2016-02-01
      2   , 2016-03-02, 2016-04-01

Such that if there's less than 7 days between an End Date and a subsequent Start Date for an account, it will merge those into one and use the End Date of the latter record and the start date of the former record.

I've experimented grouping with Lead and Lag with dplyr but that wouldn't work for accounts with 3 or more records.

In the example,

Account ID 1 is a case where it would be solved with grouping by accountID, and max, min would work

but Account ID 2 is a case where that would not work.

Any help is really appreciated.

r2evans · Accepted Answer

Your data:

dat <- read.table(text = "AccountID StartDate  EndDate
1         2016-01-01 2016-02-01
1         2016-02-02 2016-03-01
1         2016-03-01 2016-04-01
2         2016-01-01 2016-02-01
2         2016-03-02 2016-03-20
2         2016-03-21 2016-04-01", header = TRUE, stringsAsFactors = FALSE)
dat[2:3] <- lapply(dat[2:3], as.Date)

You can use lag after grouping:

library(dplyr)
group_by(dat, AccountID) %>%
  mutate(
    week = cumsum(StartDate - lag(EndDate, default = 0) > 7)
  ) %>%
  group_by(AccountID, week) %>%
  summarize(
    StartDate = min(StartDate),
    EndDate = max(EndDate)
  ) %>%
  ungroup()
# # A tibble: 3 × 4
#   AccountID  week  StartDate    EndDate
#                  
# 1         1     1 2016-01-01 2016-04-01
# 2         2     1 2016-01-01 2016-02-01
# 3         2     2 2016-03-02 2016-04-01

dplyr conditional grouping dates

Answers (1)

Related Questions