TimL
TimL

Reputation: 231

R lagging through dates with conditions

I have a dataset with what are essentially episodes of time related to an individual which can overlap (i.e. an episode could start later but finish earlier than the previous). Because of this overlap issue I'm struggling to get the latest end_date in the sequence once they're in order by start_date.

The code I've been using works to a point but I have to repeat as shown in the code below. For that reason I guess I need some loop function to go through a process until a condition is met (that the end_date is later than the end_date on the previous row, or the id indicates a new individual).

library(dplyr)

## creates example dataframe
id <- c("A","A","A","A","A","A","A","A","A","A",
        "A","A","A","B","B","B","B","B","B")
start_date <- as.Date(c("2004-01-23","2005-03-31","2005-03-31","2005-12-20","2005-12-20",
                        "2006-04-03","2007-11-26","2010-10-12","2011-08-08","2012-06-26",
                        "2012-06-26","2012-09-11","2012-10-03","2003-12-01","2006-02-28",
                        "2012-04-16","2012-08-30","2012-09-19","2012-09-28"))
end_date <- as.Date(c("2009-06-30","2005-09-17","2005-09-19","2005-12-30","2005-12-30",
                      "2006-06-19","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
                      "2012-06-26","2012-09-11","2014-04-01","2012-08-29","2006-02-28",
                      "2012-04-16","2012-09-28","2013-10-11","2013-07-19"))
target_date <- as.Date(c(NA,"2009-06-30","2009-06-30","2009-06-30","2009-06-30","2009-06-30",
                         "2009-06-30","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
                         "2012-06-26","2012-09-11",NA,"2012-08-29","2012-08-29","2012-08-29",
                         "2012-09-28","2013-10-11"))

df <- data.frame(id, start_date, end_date, target_date)

Using the method to flatten overlapping time periods gets me close but I think it needs a lag adding in somewhere to replicate the target_date...

df <- df %>%
    arrange(id, start_date) %>%
    group_by(id) %>%
    mutate(indx = c(0, cumsum(as.numeric(lead(start_date)) >
                                    cummax(as.numeric(end_date)))[-n()])) %>%
    group_by(id, indx) %>%       
    mutate(latest_date = max(end_date)) %>%
    ungroup()

Upvotes: 1

Views: 330

Answers (2)

Uwe
Uwe

Reputation: 42544

If I understand correctly, the OP wants to identify overlapping episodes which are fully embraced by the longer episode. In addition, the end date of the embracing period should appear on the next row (within an id)

This can be accomplished by a variation of David Arenburg's approach:

df %>% 
  arrange(id, start_date) %>% # df must be ordered appropriately
  group_by(id) %>% # create new grouping variable
  mutate(grp = cumsum(cummax(lag(as.integer(end_date), default = 0)) < as.integer(end_date))) %>% 
  group_by(id, grp) %>% 
  mutate(target_date_new = max(end_date)) %>% 
  group_by(id) %>% # re-group ...
  mutate(target_date_new = lag(target_date_new)) # ... for lagging
# A tibble: 19 x 6
# Groups:   id [2]
   id    start_date end_date   target_date   grp target_date_new
   <fct> <date>     <date>     <date>      <int> <date>         
 1 A     2004-01-23 2009-06-30 NA              1 NA             
 2 A     2005-03-31 2005-09-17 2009-06-30      1 2009-06-30     
 3 A     2005-03-31 2005-09-19 2009-06-30      1 2009-06-30     
 4 A     2005-12-20 2005-12-30 2009-06-30      1 2009-06-30     
 5 A     2005-12-20 2005-12-30 2009-06-30      1 2009-06-30     
 6 A     2006-04-03 2006-06-19 2009-06-30      1 2009-06-30     
 7 A     2007-11-26 2009-06-30 2009-06-30      1 2009-06-30     
 8 A     2010-10-12 2010-11-05 2009-06-30      2 2009-06-30     
 9 A     2011-08-08 2011-11-18 2010-11-05      3 2010-11-05     
10 A     2012-06-26 2012-06-26 2011-11-18      4 2011-11-18     
11 A     2012-06-26 2012-06-26 2012-06-26      4 2012-06-26     
12 A     2012-09-11 2012-09-11 2012-06-26      5 2012-06-26     
13 A     2012-10-03 2014-04-01 2012-09-11      6 2012-09-11     
14 B     2003-12-01 2012-08-29 NA              1 NA             
15 B     2006-02-28 2006-02-28 2012-08-29      1 2012-08-29     
16 B     2012-04-16 2012-04-16 2012-08-29      1 2012-08-29     
17 B     2012-08-30 2012-09-28 2012-08-29      2 2012-08-29     
18 B     2012-09-19 2013-10-11 2012-09-28      3 2012-09-28     
19 B     2012-09-28 2013-07-19 2013-10-11      3 2013-10-11

Here, end_dates are compared because the OP wants to detect fully embraced periods. So, whenever an end_date appears which is larger than any of the previous end_dates the episode counter grp is advanced because the current episode is not fully included in the previous periods.

As cummax() has no method for objects of type Date, the dates are coerced to integer value.

Upvotes: 1

Wietze314
Wietze314

Reputation: 6020

I would give this problem a different approach than using lag. The issue is that there is a hierarchical structure in your data that can have multiple levels.

In the following code I try to look for the other episodes of which the current row is a part of (i.e. lies completely within another episode). Then I take the min(start_date) and max(end_date) to define the outer most episode.


library(dplyr)
library(tidyr)
library(purrr)

df <- data.frame(id, start_date, end_date, target_date) %>%
  mutate(episode = row_number())

df %>%
  select(id, episode,start_date, end_date) %>%
  inner_join(df %>% select(id, start_date_outer = start_date, end_date_outer = end_date,outer_episode = episode), by = 'id') %>%
  group_by(id,episode,start_date, end_date) %>%
  nest() %>%
  mutate(match = pmap(list(data,start_date,end_date), ~ ..1 %>% filter(start_date_outer <= ..2,
                                                                end_date_outer >= ..3))) %>%
  mutate(start_date_parent = as.Date(map_dbl(match, ~ min(.x$start_date_outer)),origin = '1970-01-01'),
         end_date_parent = as.Date(map_dbl(match, ~max(.x$end_date_outer)),origin = '1970-01-01'))


this results in


# A tibble: 19 x 8
   id    episode start_date end_date   data              match            start_date_parent end_date_parent
   <fct>   <int> <date>     <date>     <list>            <list>           <date>            <date>         
 1 A           1 2004-01-23 2009-06-30 <tibble [13 x 3]> <tibble [1 x 3]> 2004-01-23        2009-06-30     
 2 A           2 2005-03-31 2005-09-17 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 3 A           3 2005-03-31 2005-09-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 4 A           4 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 5 A           5 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 6 A           6 2006-04-03 2006-06-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 7 A           7 2007-11-26 2009-06-30 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 8 A           8 2010-10-12 2010-11-05 <tibble [13 x 3]> <tibble [1 x 3]> 2010-10-12        2010-11-05     
 9 A           9 2011-08-08 2011-11-18 <tibble [13 x 3]> <tibble [1 x 3]> 2011-08-08        2011-11-18     
10 A          10 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26        2012-06-26     
11 A          11 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26        2012-06-26     
12 A          12 2012-09-11 2012-09-11 <tibble [13 x 3]> <tibble [1 x 3]> 2012-09-11        2012-09-11     
13 A          13 2012-10-03 2014-04-01 <tibble [13 x 3]> <tibble [1 x 3]> 2012-10-03        2014-04-01     
14 B          14 2003-12-01 2012-08-29 <tibble [6 x 3]>  <tibble [1 x 3]> 2003-12-01        2012-08-29     
15 B          15 2006-02-28 2006-02-28 <tibble [6 x 3]>  <tibble [2 x 3]> 2003-12-01        2012-08-29     
16 B          16 2012-04-16 2012-04-16 <tibble [6 x 3]>  <tibble [2 x 3]> 2003-12-01        2012-08-29     
17 B          17 2012-08-30 2012-09-28 <tibble [6 x 3]>  <tibble [1 x 3]> 2012-08-30        2012-09-28     
18 B          18 2012-09-19 2013-10-11 <tibble [6 x 3]>  <tibble [1 x 3]> 2012-09-19        2013-10-11     
19 B          19 2012-09-28 2013-07-19 <tibble [6 x 3]>  <tibble [2 x 3]> 2012-09-19        2013-10-11  

We can see here that the first 7 episodes of id A are part of episode 1 and the rest stand on their own.


Another option would be to use sqldf for example if the dataset becomes large.


require(sqldf)

result <- sqldf("select
      df1.id, df1.episode, min(df2.start_date) AS start_date, max(df2.end_date) AS end_date
      from df AS df1

      inner join df AS df2 
      on df1.id = df2.id
      and df1.start_date >= df2.start_date
      and df1.end_date <= df2.end_date

      group by df1.id, df1.episode
      ")

result %>%
  select(id, start_date, end_date) %>%
  distinct()

results in:


  id start_date   end_date
1  A 2004-01-23 2009-06-30
2  A 2010-10-12 2010-11-05
3  A 2011-08-08 2011-11-18
4  A 2012-06-26 2012-06-26
5  A 2012-09-11 2012-09-11
6  A 2012-10-03 2014-04-01
7  B 2003-12-01 2012-08-29
8  B 2012-08-30 2012-09-28
9  B 2012-09-19 2013-10-11

Upvotes: 1

Related Questions