How to get the ending date from the first observation and use it as the starting date for the second observation for the same ID?

My df has some unique and some double entries and columns showing starting and ending date for every observation, but they cannot overlap for the same id.


df <- data.frame(id = c(22,22,102,102,102),
                 start_date = as.Date(c("2013-10-29","2014-01-09",
                                 "2016-09-14",
                                 "2016-09-14","2016-09-14")), 
                 end_date = as.Date(c("2017-08-15","2018-10-05",
                                 "2016-10-09",
                                 "2017-12-12","2018-10-17")))

head(df)
   id start_date   end_date
1  22 2013-10-29 2017-08-15
2  22 2014-01-09 2018-10-05
3 102 2016-09-14 2016-10-09
4 102 2016-09-14 2017-12-12
5 102 2016-09-14 2018-10-17

ids 22 and 102 dates interval overlap, but for 22 with different start_date and for 102 with the same start_date.

The result I need is:

  1. When the dates overlap, to have the final date of the previous observation as the starting date.
  2. When the dates don't overlap, keep the actual values.

Any idea or suggestions?

The result I'd expect is:

head(fixed_df)
   id start_date   end_date
1  22 2013-10-29 2017-08-15
2  22 2017-08-15 2018-10-05
3 102 2016-09-14 2016-10-09
4 102 2016-10-09 2017-12-12
5 102 2017-12-12 2018-10-17

Upvotes: 1

Views: 71

Answers (2)

MrGumble
MrGumble

Reputation: 5766

With dplyr, I would do it as such:

library(dplyr)
df %>% group_by(id) %>%
  arrange(start_date) %>%
  mutate(
    lag(end_date),
    overlap = start_date < lag(end_date, default=as.Date('2000-01-01')),
    new_start_date = if_else(overlap, lag(end_date), start_date)
  )

     id start_date end_date   `lag(end_date)` overlap new_start_date
  <dbl> <date>     <date>     <date>          <lgl>   <date>        
1    22 2013-10-29 2017-08-15 NA              FALSE   2013-10-29    
2    22 2014-01-09 2018-10-05 2017-08-15      TRUE    2017-08-15    
3   102 2016-09-14 2016-10-09 NA              FALSE   2016-09-14    
4   102 2016-09-14 2017-12-12 2016-10-09      TRUE    2016-10-09    
5   102 2016-09-14 2018-10-17 2017-12-12      TRUE    2017-12-12   

This one is quite verbose, but merely to demonstrate what is going one.

Some key points:

  1. Use group_by to keep comparisons within id.
  2. Next, sort things.
  3. lag - compare with previous value. But use a good default value, that is also the same type.

Consider using lag(end_date) + days(1) if you want strict no overlaps.

Upvotes: 0

Chelmy88
Chelmy88

Reputation: 1116

In R, you can easily compare date objects with normal ==, > or < operators, so by using a loop and few tests here is a working solution:

#Loop over every lines except the last one
for (line in c(1:(length(df$id)-1)))
{
  #Do something only if next line have the same ID
  if(df$id[line]==df$id[line+1])
  {
    #Check if end date is after start date of the next line
    if(df$end_date[line]>df$start_date[line+1])
    {
     #If yes, put the start date of next line to end date of current line
     df$start_date[line+1]=df$end_date[line]
    }
  }


}

Upvotes: 1

Related Questions