Reputation: 757
I have a dataset that has multiple variables, two of which are dates (start date, end date). Sometimes a date interval has been split into sequences so for example you would have:
Start: 1990-12-12, Stop: 1990-12-13 Start: 1990-12-13, Stop: 1990-12-14
Rather than
Start: 1990-12-12, Stop: 1990-12-14
What I want to do is isolate these chains of sequences and basically collapse them into one observation such that all observations from the end of the sequence are saved with the rest being overwritten (except the first start date). Below is a basic example:
library(tidyverse)
library(lubridate)
tib_ex <- tibble(
id = rep(1,5),
date1 = ymd(c('1990-11-05', '1990-12-01',
'1990-12-05', '1990-12-08',
'1990-12-15')),
date2 = ymd(c('1990-11-28', '1990-12-05',
'1990-12-08', '1990-12-12',
'1990-12-31')),
var1 = 2:6,
var2 = 7:11,
var3 = 12:16,
var4 = c(0, 1, 0 ,0, 1)
)
This yields the following tibble:
# A tibble: 5 x 7
id date1 date2 var1 var2 var3 var4
<dbl> <date> <date> <int> <int> <int> <dbl>
1 1 1990-11-05 1990-11-28 2 7 12 0
2 1 1990-12-01 1990-12-05 3 8 13 1
3 1 1990-12-05 1990-12-08 4 9 14 0
4 1 1990-12-08 1990-12-12 5 10 15 0
5 1 1990-12-15 1990-12-31 6 11 16 1
Which I want to transform into the following tibble:
# A tibble: 3 x 7
id date1 date2 var1 var2 var3 var4
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1990-11-05 1990-11-28 2 7 12 0
2 1 1990-12-01 1990-12-12 5 10 15 0
3 1 1990-12-15 1990-12-31 6 11 16 1
I thought about nesting by id, date1 and date2 which packs the rest of the variables into a tibble per row making it easy to overwrite I just don't know how to efficiently collapse the dates from row 2 to row 4.
I've tried creating a binary variable which tracks if the end date of one observation matches the start date of the following observation but I'm running into difficulties there as well.
Upvotes: 3
Views: 1019
Reputation: 42564
Here is a different approach which also will work if the dataset contains more than one individual id
. According to OP's expected result the additional variables var1
to var4
are aggregated/summarized by picking the value at the end of each collapsed period.
The approach below
cumsum()
and lag()
to identify rows which belong to one period,summarize()
to collapse the start and end dates,The last step avoids to include all additional variables in the call to summarize()
.
tib_ex %>%
arrange(id, date1, date2) %>% # this is important!
group_by(id) %>%
mutate(period = cumsum(lag(date2, default = date1[1]) < date1)) %>%
right_join(
(.) %>% group_by(id, period) %>%
summarize(date1 = first(date1), date2 = last(date2)),
by = c("id", "period", "date2"), suffix = c("", ".y")) %>%
select(-period, -date1.y)
# A tibble: 3 x 7 # Groups: id [1] id date1 date2 var1 var2 var3 var4 <dbl> <date> <date> <int> <int> <int> <dbl> 1 1 1990-11-05 1990-11-28 2 7 12 0 2 1 1990-12-08 1990-12-12 5 10 15 0 3 1 1990-12-15 1990-12-31 6 11 16 1
Here is a test that the approach is working for multiple id
:
tib_ex %>%
bind_rows(
(.) %>% mutate(id = 2))
duplicates OPs dataset for id = 2
:
# A tibble: 10 x 7 id date1 date2 var1 var2 var3 var4 <dbl> <date> <date> <int> <int> <int> <dbl> 1 1 1990-11-05 1990-11-28 2 7 12 0 2 1 1990-12-01 1990-12-05 3 8 13 1 3 1 1990-12-05 1990-12-08 4 9 14 0 4 1 1990-12-08 1990-12-12 5 10 15 0 5 1 1990-12-15 1990-12-31 6 11 16 1 6 2 1990-11-05 1990-11-28 2 7 12 0 7 2 1990-12-01 1990-12-05 3 8 13 1 8 2 1990-12-05 1990-12-08 4 9 14 0 9 2 1990-12-08 1990-12-12 5 10 15 0 10 2 1990-12-15 1990-12-31 6 11 16 1
tib_ex %>%
bind_rows(
(.) %>% mutate(id = 2)) %>%
arrange(id, date1, date2) %>% # this is important!
group_by(id) %>%
mutate(period = cumsum(lag(date2, default = date1[1]) < date1)) %>%
right_join(
(.) %>% group_by(id, period) %>%
summarize(date1 = first(date1), date2 = last(date2)),
by = c("id", "period", "date2"), suffix = c("", ".y")) %>%
select(-period, -date1.y)
# A tibble: 6 x 7 # Groups: id [2] id date1 date2 var1 var2 var3 var4 <dbl> <date> <date> <int> <int> <int> <dbl> 1 1 1990-11-05 1990-11-28 2 7 12 0 2 1 1990-12-08 1990-12-12 5 10 15 0 3 1 1990-12-15 1990-12-31 6 11 16 1 4 2 1990-11-05 1990-11-28 2 7 12 0 5 2 1990-12-08 1990-12-12 5 10 15 0 6 2 1990-12-15 1990-12-31 6 11 16 1
Upvotes: 1
Reputation: 5670
Find the rows with start and end dates by comparing with the next / previous row and combine the result in a suitable way:
date_info <-
tib_ex %>%
## find indices of start and end dates by comparing with date in next / previous row
mutate(is_startdate = date1 != lag(date2),
is_enddate = date2 != lead(date1)) %>%
## NA's appear at the beginning (start_date) and end (end_date) and should thus be interpreted as TRUE
replace_na(list(is_startdate = T, is_enddate = T))
## combine the start- and end-dates
date_info %>%
filter(is_enddate) %>%
mutate(date1 = date_info$date1[date_info$is_startdate]) %>%
select(-starts_with("is_"))
-------
# A tibble: 3 x 7
id date1 date2 var1 var2 var3 var4
<dbl> <date> <date> <int> <int> <int> <dbl>
1 1.00 1990-11-05 1990-11-28 2 7 12 0
2 1.00 1990-12-01 1990-12-12 5 10 15 0
3 1.00 1990-12-15 1990-12-31 6 11 16 1.00
Upvotes: 7