Kevin M
Kevin M

Reputation: 59

Consolidate multiple time intervals within a group unit (dplyr or data.table solution preferred)

I have a tricky little coding issue I was hoping someone might have a solution for.

I essentially have a very large dataset of stays (person ids, admits, discharges), >10 million.

library(dplyr)
library(lubridate)

dat <- read.csv(text="
personid, start, end
1, 2017-09-01, 2017-10-01
1, 2017-10-05, 2017-10-07
2, 2017-10-21, 2017-11-01
3, 2017-12-01, 2017-12-15
3, 2017-12-27, 2017-12-31") %>%
  transmute(
    personid,
    start = ymd(start), 
    end = ymd(end))

Each stay is non-overlapping, but we have a logic rule where if the stays are within 10 days of each other we want to consolidate them as one (i.e keep the earlier admit and the later discharge). So that the final dataset are unique stays at least 10 days from each other.

e.g.:

1, 2017-09-01, 2017-10-07
2, 2017-10-21, 2017-11-01
3, 2017-12-01, 2017-12-15
3, 2017-12-27, 2017-12-31

There are a few posts about overlapping intervals, but this is a little different: Consolidating set of time intervals, chains of intervals to single interval I think overly complex compared to what I need.

I was also hoping for a dplyr or data.table solution, though the group_by statement takes quite a while.

Upvotes: 2

Views: 124

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389265

One tidyverse possibility. We first group_by person_id and create a new variable (diffe) which has difference in days between current start day and previous (lag) end day. We group every person_id and diffe which is within 10 days into one group and select first start day and last end day from each group.

library(tidyverse)

dat %>%
  group_by(personid) %>%
  mutate(diffe = as.numeric(start - lag(end))) %>%
  replace_na(list(diffe = 0)) %>%
  group_by(personid, group = cumsum(diffe > 10)) %>%
  summarise(start = first(start), 
            end = last(end)) %>%
  select(-group)


#  personid   start      end       
#     <int>   <date>     <date>    
#1        1 2017-09-01 2017-10-07
#2        2 2017-10-21 2017-11-01
#3        3 2017-12-01 2017-12-15
#4        3 2017-12-27 2017-12-31

Upvotes: 2

Related Questions