SorenK
SorenK

Reputation: 155

Assign multiple start and stop dates for single variable in group

I have a dataset, where I have different provider attending a site multiple times.

I want to create columns to show start and stop dates where they are present.

Here's a sample dataset:

x <- tibble(
  "site" = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
  "provider" = c("A", "B", "C", "A", "A", "C", "C", "A", "C", "A"),
  "date" = seq(from = as.Date("2019-01-01"), to = as.Date("2019-01-10"), by = 'day'),
)

I've been unable to create a bit of code that adequately captures the fact that a provider might be present several times. The best I got so far is:

x %>%
  group_by(site, provider) %>%
  mutate("start" = min(date),
         "end" = max(date))

Which creates this:

# A tibble: 10 x 5
# Groups:   site, provider [5]
    site provider date       start      end       
   <dbl> <chr>    <date>     <date>     <date>    
 1     1 A        2019-01-01 2019-01-01 2019-01-05
 2     1 B        2019-01-02 2019-01-02 2019-01-02
 3     1 C        2019-01-03 2019-01-03 2019-01-03
 4     1 A        2019-01-04 2019-01-01 2019-01-05
 5     1 A        2019-01-05 2019-01-01 2019-01-05
 6     2 C        2019-01-06 2019-01-06 2019-01-09
 7     2 C        2019-01-07 2019-01-06 2019-01-09
 8     2 A        2019-01-08 2019-01-08 2019-01-10
 9     2 C        2019-01-09 2019-01-06 2019-01-09
10     2 A        2019-01-10 2019-01-08 2019-01-10

However, this only process the individual provider once per site.

Here is what I would like the final dataset to look like:

# A tibble: 10 x 5
# Groups:   site, provider [5]
    site provider date       start      end       
   <dbl> <chr>    <date>     <date>     <date>    
 1     1 A        2019-01-01 2019-01-01 2019-01-01  # A stops at 2019-01-01
 2     1 B        2019-01-02 2019-01-02 2019-01-02
 3     1 C        2019-01-03 2019-01-03 2019-01-03
 4     1 A        2019-01-04 2019-01-04 2019-01-05  # A restarts from 2019-01-04
 5     1 A        2019-01-05 2019-01-04 2019-01-05
 6     2 C        2019-01-06 2019-01-06 2019-01-07
 7     2 C        2019-01-07 2019-01-06 2019-01-07  # C stops at 2019-01-07
 8     2 A        2019-01-08 2019-01-08 2019-01-08  # A stops at 2019-01-08
 9     2 C        2019-01-09 2019-01-09 2019-01-09  # C restarts at 2019-01-09
10     2 A        2019-01-10 2019-01-10 2019-01-10  # A restarts at 2019-01-10

Upvotes: 1

Views: 59

Answers (2)

akrun
akrun

Reputation: 887118

An option using only dplyr where the 'grp' is created by comparing adjacent elements of 'provider', take the cumulative sum of the logical vector and find the min and max of 'date' after grouping by 'site', 'provider' and 'grp'

library(dplyr)
x %>%
   group_by(site, provider, 
      grp = cumsum(provider != lag(provider, default = first(provider))))  %>% 
   mutate(start = min(date), end = max(date)) %>% 
   ungroup %>%
   select(-grp)
# A tibble: 10 x 5
#    site provider date       start      end       
#   <dbl> <chr>    <date>     <date>     <date>    
# 1     1 A        2019-01-01 2019-01-01 2019-01-01
# 2     1 B        2019-01-02 2019-01-02 2019-01-02
3 3     1 C        2019-01-03 2019-01-03 2019-01-03
# 4     1 A        2019-01-04 2019-01-04 2019-01-05
# 5     1 A        2019-01-05 2019-01-04 2019-01-05
# 6     2 C        2019-01-06 2019-01-06 2019-01-07
# 7     2 C        2019-01-07 2019-01-06 2019-01-07
# 8     2 A        2019-01-08 2019-01-08 2019-01-08
# 9     2 C        2019-01-09 2019-01-09 2019-01-09
#10     2 A        2019-01-10 2019-01-10 2019-01-10

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

Use data.table::rleid to create an additional group

library(dplyr)

x %>%
  group_by(site, provider, group = data.table::rleid(provider)) %>%
  mutate(start = min(date), end = max(date)) %>%
  ungroup() %>%
  select(-group)

 # A tibble: 10 x 5
#    site provider date       start      end       
#   <dbl> <chr>    <date>     <date>     <date>    
# 1     1 A        2019-01-01 2019-01-01 2019-01-01
# 2     1 B        2019-01-02 2019-01-02 2019-01-02
# 3     1 C        2019-01-03 2019-01-03 2019-01-03
# 4     1 A        2019-01-04 2019-01-04 2019-01-05
# 5     1 A        2019-01-05 2019-01-04 2019-01-05
# 6     2 C        2019-01-06 2019-01-06 2019-01-07
# 7     2 C        2019-01-07 2019-01-06 2019-01-07
# 8     2 A        2019-01-08 2019-01-08 2019-01-08
# 9     2 C        2019-01-09 2019-01-09 2019-01-09
#10     2 A        2019-01-10 2019-01-10 2019-01-10

Upvotes: 1

Related Questions