Reputation: 11
My df has some unique and some double entries and columns showing starting and ending date for every observation, but they cannot overlap for the same id.
df <- data.frame(id = c(22,22,102,102,102),
start_date = as.Date(c("2013-10-29","2014-01-09",
"2016-09-14",
"2016-09-14","2016-09-14")),
end_date = as.Date(c("2017-08-15","2018-10-05",
"2016-10-09",
"2017-12-12","2018-10-17")))
head(df)
id start_date end_date
1 22 2013-10-29 2017-08-15
2 22 2014-01-09 2018-10-05
3 102 2016-09-14 2016-10-09
4 102 2016-09-14 2017-12-12
5 102 2016-09-14 2018-10-17
ids 22 and 102 dates interval overlap, but for 22 with different start_date and for 102 with the same start_date.
The result I need is:
Any idea or suggestions?
The result I'd expect is:
head(fixed_df)
id start_date end_date
1 22 2013-10-29 2017-08-15
2 22 2017-08-15 2018-10-05
3 102 2016-09-14 2016-10-09
4 102 2016-10-09 2017-12-12
5 102 2017-12-12 2018-10-17
Upvotes: 1
Views: 71
Reputation: 5766
With dplyr
, I would do it as such:
library(dplyr)
df %>% group_by(id) %>%
arrange(start_date) %>%
mutate(
lag(end_date),
overlap = start_date < lag(end_date, default=as.Date('2000-01-01')),
new_start_date = if_else(overlap, lag(end_date), start_date)
)
id start_date end_date `lag(end_date)` overlap new_start_date
<dbl> <date> <date> <date> <lgl> <date>
1 22 2013-10-29 2017-08-15 NA FALSE 2013-10-29
2 22 2014-01-09 2018-10-05 2017-08-15 TRUE 2017-08-15
3 102 2016-09-14 2016-10-09 NA FALSE 2016-09-14
4 102 2016-09-14 2017-12-12 2016-10-09 TRUE 2016-10-09
5 102 2016-09-14 2018-10-17 2017-12-12 TRUE 2017-12-12
This one is quite verbose, but merely to demonstrate what is going one.
Some key points:
group_by
to keep comparisons within id
.lag
- compare with previous value. But use a good default value, that is also the same type.Consider using lag(end_date) + days(1)
if you want strict no overlaps.
Upvotes: 0
Reputation: 1116
In R, you can easily compare date objects with normal ==, > or < operators, so by using a loop and few tests here is a working solution:
#Loop over every lines except the last one
for (line in c(1:(length(df$id)-1)))
{
#Do something only if next line have the same ID
if(df$id[line]==df$id[line+1])
{
#Check if end date is after start date of the next line
if(df$end_date[line]>df$start_date[line+1])
{
#If yes, put the start date of next line to end date of current line
df$start_date[line+1]=df$end_date[line]
}
}
}
Upvotes: 1