Reputation: 231
I have a dataset with what are essentially episodes of time related to an individual which can overlap (i.e. an episode could start later but finish earlier than the previous). Because of this overlap issue I'm struggling to get the latest end_date in the sequence once they're in order by start_date.
The code I've been using works to a point but I have to repeat as shown in the code below. For that reason I guess I need some loop function to go through a process until a condition is met (that the end_date is later than the end_date on the previous row, or the id indicates a new individual).
library(dplyr)
## creates example dataframe
id <- c("A","A","A","A","A","A","A","A","A","A",
"A","A","A","B","B","B","B","B","B")
start_date <- as.Date(c("2004-01-23","2005-03-31","2005-03-31","2005-12-20","2005-12-20",
"2006-04-03","2007-11-26","2010-10-12","2011-08-08","2012-06-26",
"2012-06-26","2012-09-11","2012-10-03","2003-12-01","2006-02-28",
"2012-04-16","2012-08-30","2012-09-19","2012-09-28"))
end_date <- as.Date(c("2009-06-30","2005-09-17","2005-09-19","2005-12-30","2005-12-30",
"2006-06-19","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
"2012-06-26","2012-09-11","2014-04-01","2012-08-29","2006-02-28",
"2012-04-16","2012-09-28","2013-10-11","2013-07-19"))
target_date <- as.Date(c(NA,"2009-06-30","2009-06-30","2009-06-30","2009-06-30","2009-06-30",
"2009-06-30","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
"2012-06-26","2012-09-11",NA,"2012-08-29","2012-08-29","2012-08-29",
"2012-09-28","2013-10-11"))
df <- data.frame(id, start_date, end_date, target_date)
Using the method to flatten overlapping time periods gets me close but I think it needs a lag adding in somewhere to replicate the target_date...
df <- df %>%
arrange(id, start_date) %>%
group_by(id) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_date)) >
cummax(as.numeric(end_date)))[-n()])) %>%
group_by(id, indx) %>%
mutate(latest_date = max(end_date)) %>%
ungroup()
Upvotes: 1
Views: 330
Reputation: 42544
If I understand correctly, the OP wants to identify overlapping episodes which are fully embraced by the longer episode. In addition, the end date of the embracing period should appear on the next row (within an id
)
This can be accomplished by a variation of David Arenburg's approach:
df %>%
arrange(id, start_date) %>% # df must be ordered appropriately
group_by(id) %>% # create new grouping variable
mutate(grp = cumsum(cummax(lag(as.integer(end_date), default = 0)) < as.integer(end_date))) %>%
group_by(id, grp) %>%
mutate(target_date_new = max(end_date)) %>%
group_by(id) %>% # re-group ...
mutate(target_date_new = lag(target_date_new)) # ... for lagging
# A tibble: 19 x 6 # Groups: id [2] id start_date end_date target_date grp target_date_new <fct> <date> <date> <date> <int> <date> 1 A 2004-01-23 2009-06-30 NA 1 NA 2 A 2005-03-31 2005-09-17 2009-06-30 1 2009-06-30 3 A 2005-03-31 2005-09-19 2009-06-30 1 2009-06-30 4 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30 5 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30 6 A 2006-04-03 2006-06-19 2009-06-30 1 2009-06-30 7 A 2007-11-26 2009-06-30 2009-06-30 1 2009-06-30 8 A 2010-10-12 2010-11-05 2009-06-30 2 2009-06-30 9 A 2011-08-08 2011-11-18 2010-11-05 3 2010-11-05 10 A 2012-06-26 2012-06-26 2011-11-18 4 2011-11-18 11 A 2012-06-26 2012-06-26 2012-06-26 4 2012-06-26 12 A 2012-09-11 2012-09-11 2012-06-26 5 2012-06-26 13 A 2012-10-03 2014-04-01 2012-09-11 6 2012-09-11 14 B 2003-12-01 2012-08-29 NA 1 NA 15 B 2006-02-28 2006-02-28 2012-08-29 1 2012-08-29 16 B 2012-04-16 2012-04-16 2012-08-29 1 2012-08-29 17 B 2012-08-30 2012-09-28 2012-08-29 2 2012-08-29 18 B 2012-09-19 2013-10-11 2012-09-28 3 2012-09-28 19 B 2012-09-28 2013-07-19 2013-10-11 3 2013-10-11
Here, end_date
s are compared because the OP wants to detect fully embraced periods. So, whenever an end_date
appears which is larger than any of the previous end_date
s the episode counter grp
is advanced because the current episode is not fully included in the previous periods.
As cummax()
has no method for objects of type Date
, the dates are coerced to integer value.
Upvotes: 1
Reputation: 6020
I would give this problem a different approach than using lag
. The issue is that there is a hierarchical structure in your data that can have multiple levels.
In the following code I try to look for the other episodes of which the current row is a part of (i.e. lies completely within another episode).
Then I take the min(start_date)
and max(end_date)
to define the outer most episode.
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(id, start_date, end_date, target_date) %>%
mutate(episode = row_number())
df %>%
select(id, episode,start_date, end_date) %>%
inner_join(df %>% select(id, start_date_outer = start_date, end_date_outer = end_date,outer_episode = episode), by = 'id') %>%
group_by(id,episode,start_date, end_date) %>%
nest() %>%
mutate(match = pmap(list(data,start_date,end_date), ~ ..1 %>% filter(start_date_outer <= ..2,
end_date_outer >= ..3))) %>%
mutate(start_date_parent = as.Date(map_dbl(match, ~ min(.x$start_date_outer)),origin = '1970-01-01'),
end_date_parent = as.Date(map_dbl(match, ~max(.x$end_date_outer)),origin = '1970-01-01'))
this results in
# A tibble: 19 x 8
id episode start_date end_date data match start_date_parent end_date_parent
<fct> <int> <date> <date> <list> <list> <date> <date>
1 A 1 2004-01-23 2009-06-30 <tibble [13 x 3]> <tibble [1 x 3]> 2004-01-23 2009-06-30
2 A 2 2005-03-31 2005-09-17 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
3 A 3 2005-03-31 2005-09-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
4 A 4 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
5 A 5 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
6 A 6 2006-04-03 2006-06-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
7 A 7 2007-11-26 2009-06-30 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
8 A 8 2010-10-12 2010-11-05 <tibble [13 x 3]> <tibble [1 x 3]> 2010-10-12 2010-11-05
9 A 9 2011-08-08 2011-11-18 <tibble [13 x 3]> <tibble [1 x 3]> 2011-08-08 2011-11-18
10 A 10 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26 2012-06-26
11 A 11 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26 2012-06-26
12 A 12 2012-09-11 2012-09-11 <tibble [13 x 3]> <tibble [1 x 3]> 2012-09-11 2012-09-11
13 A 13 2012-10-03 2014-04-01 <tibble [13 x 3]> <tibble [1 x 3]> 2012-10-03 2014-04-01
14 B 14 2003-12-01 2012-08-29 <tibble [6 x 3]> <tibble [1 x 3]> 2003-12-01 2012-08-29
15 B 15 2006-02-28 2006-02-28 <tibble [6 x 3]> <tibble [2 x 3]> 2003-12-01 2012-08-29
16 B 16 2012-04-16 2012-04-16 <tibble [6 x 3]> <tibble [2 x 3]> 2003-12-01 2012-08-29
17 B 17 2012-08-30 2012-09-28 <tibble [6 x 3]> <tibble [1 x 3]> 2012-08-30 2012-09-28
18 B 18 2012-09-19 2013-10-11 <tibble [6 x 3]> <tibble [1 x 3]> 2012-09-19 2013-10-11
19 B 19 2012-09-28 2013-07-19 <tibble [6 x 3]> <tibble [2 x 3]> 2012-09-19 2013-10-11
We can see here that the first 7 episodes of id A are part of episode 1 and the rest stand on their own.
Another option would be to use sqldf
for example if the dataset becomes large.
require(sqldf)
result <- sqldf("select
df1.id, df1.episode, min(df2.start_date) AS start_date, max(df2.end_date) AS end_date
from df AS df1
inner join df AS df2
on df1.id = df2.id
and df1.start_date >= df2.start_date
and df1.end_date <= df2.end_date
group by df1.id, df1.episode
")
result %>%
select(id, start_date, end_date) %>%
distinct()
results in:
id start_date end_date
1 A 2004-01-23 2009-06-30
2 A 2010-10-12 2010-11-05
3 A 2011-08-08 2011-11-18
4 A 2012-06-26 2012-06-26
5 A 2012-09-11 2012-09-11
6 A 2012-10-03 2014-04-01
7 B 2003-12-01 2012-08-29
8 B 2012-08-30 2012-09-28
9 B 2012-09-19 2013-10-11
Upvotes: 1