syhwyqp
syhwyqp

Reputation: 77

Collapsing consecutive dates into a single row

I saw many threads about how to merge consecutive dates into a single row, and tried few of them (including this and using lead from dplyr), but so far didn't find thread that specifically answers my question.

Here's what my data looks like:

df <- data.frame(
    id = c("A", "A", "A", "B", "B", "C", "C", "C"),
    start = as.Date(c("2013-05-21", "2014-03-17", "2014-12-12", "2009-03-08", 
                      "2011-07-30", "2008-10-07", "2009-11-21", "2010-12-01")),
    end = as.Date(c("2014-03-16", "2014-12-11", NA, "2011-07-14", 
                    NA, "2009-11-20", NA, NA)),
    status = c("expired", "expired", "active", "expired", 
               "active", "expired", "expired", "active")
    )

And below is my desired output:

id          start          end          status
A           2013-05-21     NA           active
B           2009-03-08     2011-07-14   expired
B           2011-07-30     NA           active
C           2008-10-07     NA           active 

So what I want to do is threefold:

1) If rows are consecutive, i.e. end date + 1 is the start date of the next row, I'd like to collapse them into a single row (as in id A)

2) If rows are not consecutive, i.e. end date + 1 is NOT the start date of the next row, I'd like to keep them separate (as in id B)

3) If "expired" rows do not have end date, I'd still like to collapse them into a single row (as in id C)

Any help would be greatly appreciated!

Upvotes: 1

Views: 391

Answers (1)

arg0naut91
arg0naut91

Reputation: 14764

You could go for something like:

library(tidyverse)

df %>%
  group_by(id) %>%
  mutate(
    end = if_else(is.na(end), lead(start), end),
    flag = if_else(start <= lag(end) + 1, 0, 1),
    flag = if_else(is.na(flag), 0, flag),
    group = cumsum(flag),
    flag = NULL
  ) %>%
  group_by(id, group) %>%
  mutate(
    start = first(start),
    end = last(end),
    status = last(status)
  ) %>% ungroup() %>% 
  distinct(id, start, end, status)

Output:

# A tibble: 4 x 4
  id    start      end        status 
  <fct> <date>     <date>     <fct>  
1 A     2013-05-21 NA         active 
2 B     2009-03-08 2011-07-14 expired
3 B     2011-07-30 NA         active 
4 C     2008-10-07 NA         active 

Upvotes: 2

Related Questions