docjay
docjay

Reputation: 757

R - (Tidyverse) Compress multiple observations into one

I have a dataset that has multiple variables, two of which are dates (start date, end date). Sometimes a date interval has been split into sequences so for example you would have:

Start: 1990-12-12, Stop: 1990-12-13 Start: 1990-12-13, Stop: 1990-12-14

Rather than

Start: 1990-12-12, Stop: 1990-12-14

What I want to do is isolate these chains of sequences and basically collapse them into one observation such that all observations from the end of the sequence are saved with the rest being overwritten (except the first start date). Below is a basic example:

library(tidyverse)
library(lubridate)

tib_ex <- tibble(
  id = rep(1,5),
  date1 = ymd(c('1990-11-05', '1990-12-01', 
                '1990-12-05', '1990-12-08', 
                '1990-12-15')),
  date2 = ymd(c('1990-11-28', '1990-12-05', 
                '1990-12-08', '1990-12-12', 
                '1990-12-31')),
  var1 = 2:6,
  var2 = 7:11,
  var3 = 12:16,
  var4 = c(0, 1, 0 ,0, 1)
)

This yields the following tibble:

# A tibble: 5 x 7
     id date1      date2       var1  var2  var3  var4
  <dbl> <date>     <date>     <int> <int> <int> <dbl>
1     1 1990-11-05 1990-11-28     2     7    12     0
2     1 1990-12-01 1990-12-05     3     8    13     1
3     1 1990-12-05 1990-12-08     4     9    14     0
4     1 1990-12-08 1990-12-12     5    10    15     0
5     1 1990-12-15 1990-12-31     6    11    16     1

Which I want to transform into the following tibble:

# A tibble: 3 x 7
     id date1      date2       var1  var2  var3  var4
  <dbl> <chr>      <chr>      <dbl> <dbl> <dbl> <dbl>
1     1 1990-11-05 1990-11-28     2     7    12     0
2     1 1990-12-01 1990-12-12     5    10    15     0
3     1 1990-12-15 1990-12-31     6    11    16     1

I thought about nesting by id, date1 and date2 which packs the rest of the variables into a tibble per row making it easy to overwrite I just don't know how to efficiently collapse the dates from row 2 to row 4.

I've tried creating a binary variable which tracks if the end date of one observation matches the start date of the following observation but I'm running into difficulties there as well.

Upvotes: 3

Views: 1019

Answers (2)

Uwe
Uwe

Reputation: 42564

Here is a different approach which also will work if the dataset contains more than one individual id. According to OP's expected result the additional variables var1 to var4 are aggregated/summarized by picking the value at the end of each collapsed period.

The approach below

  • uses cumsum() and lag() to identify rows which belong to one period,
  • uses summarize() to collapse the start and end dates,
  • and joins with the original dataset to pick the values at the end of each collapsed period.

The last step avoids to include all additional variables in the call to summarize().

tib_ex %>% 
  arrange(id, date1, date2) %>%   # this is important!
  group_by(id) %>% 
  mutate(period = cumsum(lag(date2, default = date1[1]) < date1)) %>% 
  right_join(
    (.) %>% group_by(id, period) %>% 
      summarize(date1 = first(date1), date2 = last(date2)),
    by = c("id", "period", "date2"), suffix = c("", ".y")) %>% 
  select(-period, -date1.y) 
# A tibble: 3 x 7
# Groups:   id [1]
     id date1      date2       var1  var2  var3  var4
  <dbl> <date>     <date>     <int> <int> <int> <dbl>
1     1 1990-11-05 1990-11-28     2     7    12     0
2     1 1990-12-08 1990-12-12     5    10    15     0
3     1 1990-12-15 1990-12-31     6    11    16     1

Here is a test that the approach is working for multiple id:

tib_ex %>% 
  bind_rows(
  (.) %>% mutate(id = 2))

duplicates OPs dataset for id = 2:

# A tibble: 10 x 7
      id date1      date2       var1  var2  var3  var4
   <dbl> <date>     <date>     <int> <int> <int> <dbl>
 1     1 1990-11-05 1990-11-28     2     7    12     0
 2     1 1990-12-01 1990-12-05     3     8    13     1
 3     1 1990-12-05 1990-12-08     4     9    14     0
 4     1 1990-12-08 1990-12-12     5    10    15     0
 5     1 1990-12-15 1990-12-31     6    11    16     1
 6     2 1990-11-05 1990-11-28     2     7    12     0
 7     2 1990-12-01 1990-12-05     3     8    13     1
 8     2 1990-12-05 1990-12-08     4     9    14     0
 9     2 1990-12-08 1990-12-12     5    10    15     0
10     2 1990-12-15 1990-12-31     6    11    16     1
tib_ex %>% 
  bind_rows(
    (.) %>% mutate(id = 2)) %>%
  arrange(id, date1, date2) %>%   # this is important!
  group_by(id) %>% 
  mutate(period = cumsum(lag(date2, default = date1[1]) < date1)) %>% 
  right_join(
    (.) %>% group_by(id, period) %>% 
      summarize(date1 = first(date1), date2 = last(date2)),
    by = c("id", "period", "date2"), suffix = c("", ".y")) %>% 
  select(-period, -date1.y) 
# A tibble: 6 x 7
# Groups:   id [2]
     id date1      date2       var1  var2  var3  var4
  <dbl> <date>     <date>     <int> <int> <int> <dbl>
1     1 1990-11-05 1990-11-28     2     7    12     0
2     1 1990-12-08 1990-12-12     5    10    15     0
3     1 1990-12-15 1990-12-31     6    11    16     1
4     2 1990-11-05 1990-11-28     2     7    12     0
5     2 1990-12-08 1990-12-12     5    10    15     0
6     2 1990-12-15 1990-12-31     6    11    16     1

Upvotes: 1

AEF
AEF

Reputation: 5670

Find the rows with start and end dates by comparing with the next / previous row and combine the result in a suitable way:

date_info <- 
  tib_ex %>% 
  ## find indices of start and end dates by comparing with date in next / previous row
  mutate(is_startdate = date1 != lag(date2),
         is_enddate = date2 != lead(date1)) %>% 
  ## NA's appear at the beginning (start_date) and end (end_date) and should thus be interpreted as TRUE
  replace_na(list(is_startdate = T, is_enddate = T))

## combine the start- and end-dates
date_info %>% 
  filter(is_enddate) %>% 
  mutate(date1 = date_info$date1[date_info$is_startdate]) %>% 
  select(-starts_with("is_"))

-------
# A tibble: 3 x 7
id date1      date2       var1  var2  var3  var4
<dbl> <date>     <date>     <int> <int> <int> <dbl>
1  1.00 1990-11-05 1990-11-28     2     7    12  0   
2  1.00 1990-12-01 1990-12-12     5    10    15  0   
3  1.00 1990-12-15 1990-12-31     6    11    16  1.00

Upvotes: 7

Related Questions