Reputation: 59
I have a tricky little coding issue I was hoping someone might have a solution for.
I essentially have a very large dataset of stays (person ids, admits, discharges), >10 million.
library(dplyr)
library(lubridate)
dat <- read.csv(text="
personid, start, end
1, 2017-09-01, 2017-10-01
1, 2017-10-05, 2017-10-07
2, 2017-10-21, 2017-11-01
3, 2017-12-01, 2017-12-15
3, 2017-12-27, 2017-12-31") %>%
transmute(
personid,
start = ymd(start),
end = ymd(end))
Each stay is non-overlapping, but we have a logic rule where if the stays are within 10 days of each other we want to consolidate them as one (i.e keep the earlier admit and the later discharge). So that the final dataset are unique stays at least 10 days from each other.
e.g.:
1, 2017-09-01, 2017-10-07
2, 2017-10-21, 2017-11-01
3, 2017-12-01, 2017-12-15
3, 2017-12-27, 2017-12-31
There are a few posts about overlapping intervals, but this is a little different: Consolidating set of time intervals, chains of intervals to single interval I think overly complex compared to what I need.
I was also hoping for a dplyr or data.table solution, though the group_by statement takes quite a while.
Upvotes: 2
Views: 124
Reputation: 389265
One tidyverse
possibility. We first group_by
person_id
and create a new variable (diffe
) which has difference in days between current start
day and previous (lag
) end
day. We group every person_id
and diffe
which is within 10 days into one group and select first
start
day and last
end
day from each group.
library(tidyverse)
dat %>%
group_by(personid) %>%
mutate(diffe = as.numeric(start - lag(end))) %>%
replace_na(list(diffe = 0)) %>%
group_by(personid, group = cumsum(diffe > 10)) %>%
summarise(start = first(start),
end = last(end)) %>%
select(-group)
# personid start end
# <int> <date> <date>
#1 1 2017-09-01 2017-10-07
#2 2 2017-10-21 2017-11-01
#3 3 2017-12-01 2017-12-15
#4 3 2017-12-27 2017-12-31
Upvotes: 2