Joseph Noirre
Joseph Noirre

Reputation: 387

dplyr conditional grouping dates

I have a dataframe in this format

Account ID, Start Date, End Date
      1   , 2016-01-01, 2016-02-01
      1   , 2016-02-02, 2016-03-01
      1   , 2016-03-01, 2016-04-01
      2   , 2016-01-01, 2016-02-01
      2   , 2016-03-02, 2016-03-20
      2   , 2016-03-21, 2016-04-01

I want the resulting dataframe to look like.

Account ID, Start Date, End Date
      1   , 2016-01-01, 2016-04-01
      2   , 2016-01-01, 2016-02-01
      2   , 2016-03-02, 2016-04-01

Such that if there's less than 7 days between an End Date and a subsequent Start Date for an account, it will merge those into one and use the End Date of the latter record and the start date of the former record.

I've experimented grouping with Lead and Lag with dplyr but that wouldn't work for accounts with 3 or more records.

In the example,

Account ID 1 is a case where it would be solved with grouping by accountID, and max, min would work

but Account ID 2 is a case where that would not work.

Any help is really appreciated.

Upvotes: 0

Views: 375

Answers (1)

r2evans
r2evans

Reputation: 160872

Your data:

dat <- read.table(text = "AccountID StartDate  EndDate
1         2016-01-01 2016-02-01
1         2016-02-02 2016-03-01
1         2016-03-01 2016-04-01
2         2016-01-01 2016-02-01
2         2016-03-02 2016-03-20
2         2016-03-21 2016-04-01", header = TRUE, stringsAsFactors = FALSE)
dat[2:3] <- lapply(dat[2:3], as.Date)

You can use lag after grouping:

library(dplyr)
group_by(dat, AccountID) %>%
  mutate(
    week = cumsum(StartDate - lag(EndDate, default = 0) > 7)
  ) %>%
  group_by(AccountID, week) %>%
  summarize(
    StartDate = min(StartDate),
    EndDate = max(EndDate)
  ) %>%
  ungroup()
# # A tibble: 3 × 4
#   AccountID  week  StartDate    EndDate
#       <int> <int>     <date>     <date>
# 1         1     1 2016-01-01 2016-04-01
# 2         2     1 2016-01-01 2016-02-01
# 3         2     2 2016-03-02 2016-04-01

Upvotes: 2

Related Questions