Reputation: 103
I have data that consists of observations of the range of time a particular status applied for an individual. An individual can have multiple statuses at a given time, or have no stats at all (in which case there would not be an observation).
ID STATUS START END
1 A 2013-03-07 2013-04-20
1 B 2013-04-10 2013-05-10
1 C 2013-04-16 2013-07-11
1 B 2013-07-25 2013-08-08
2 A 2013-09-10 2014-04-21
2 C 2013-12-27 2014-01-26
2 D 2014-04-28 2014-05-10
2 E 2014-05-11 2014-07-16
I would like to convert this into a dataframe consisting of distinct, non-overlapping periods and all of the statuses that applied during those periods. The result from the example would be as below.
ID STATUS START END
1 A 2013-03-07 2013-04-10
1 A,B 2013-04-10 2013-04-16
1 A,B,C 2013-04-16 2013-04-20
1 B,C 2013-04-20 2013-05-10
1 C 2013-05-10 2013-07-11
1 B 2013-07-25 2013-08-08
2 A 2013-09-10 2013-12-27
2 A,C 2013-12-27 2014-01-26
2 D 2014-04-28 2014-05-10
2 E 2014-05-11 2014-07-16
My data is fairly large, and so I struggling to find an efficient way to do this. I've found semi-related problems, such as here, but they typically involve collapsing overlapping regions into a super region, rather than breaking them out into distinct subregions.
Any help or ideas would be appreciated. Thanks.
Upvotes: 1
Views: 147
Reputation: 5201
library(dplyr)
# Create all the possible date ranges
date_breaks <- sort(c(df$START, df$END))
# Build a data.frame with all possible combinations
df2 <- expand.grid(START = head(date_breaks, -1),
ID = unique(df$ID),
STATUS = unique(df$STATUS),
stringsAsFactors = F)
df2$END <- tail(date_breaks, -1)
# Join in original data and filter to where the start and end are in range
df2 %>%
inner_join(df, by = c("ID", "STATUS")) %>%
filter(START.y <= START.x, END.y >= END.x) %>%
group_by(ID, START = START.x, END = END.x) %>%
summarise(STATUS = paste(unique(STATUS), collapse = ', ')) %>%
select(ID, STATUS, START, END)
# A tibble: 11 x 4
# Groups: ID, START [11]
# ID STATUS START END
# <int> <chr> <date> <date>
# 1 1 A 2013-03-07 2013-04-10
# 2 1 A, B 2013-04-10 2013-04-16
# 3 1 A, B, C 2013-04-16 2013-04-20
# 4 1 B, C 2013-04-20 2013-05-10
# 5 1 C 2013-05-10 2013-07-11
# 6 1 B 2013-07-25 2013-08-08
# 7 2 A 2013-09-10 2013-12-27
# 8 2 A, C 2013-12-27 2014-01-26
# 9 2 A 2014-01-26 2014-04-21
# 10 2 D 2014-04-28 2014-05-10
# 11 2 E 2014-05-11 2014-07-16
Data
df <- read.table(text = "ID STATUS START END
1 A 2013-03-07 2013-04-20
1 B 2013-04-10 2013-05-10
1 C 2013-04-16 2013-07-11
1 B 2013-07-25 2013-08-08
2 A 2013-09-10 2014-04-21
2 C 2013-12-27 2014-01-26
2 D 2014-04-28 2014-05-10
2 E 2014-05-11 2014-07-16",
header = T,
colClasses = c('integer', 'character', 'Date', 'Date'))
Upvotes: 2