Reputation: 155
I have a dataset, where I have different provider
attending a site
multiple times.
I want to create columns to show start
and stop
dates where they are present.
Here's a sample dataset:
x <- tibble(
"site" = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
"provider" = c("A", "B", "C", "A", "A", "C", "C", "A", "C", "A"),
"date" = seq(from = as.Date("2019-01-01"), to = as.Date("2019-01-10"), by = 'day'),
)
I've been unable to create a bit of code that adequately captures the fact that a provider might be present several times. The best I got so far is:
x %>%
group_by(site, provider) %>%
mutate("start" = min(date),
"end" = max(date))
Which creates this:
# A tibble: 10 x 5
# Groups: site, provider [5]
site provider date start end
<dbl> <chr> <date> <date> <date>
1 1 A 2019-01-01 2019-01-01 2019-01-05
2 1 B 2019-01-02 2019-01-02 2019-01-02
3 1 C 2019-01-03 2019-01-03 2019-01-03
4 1 A 2019-01-04 2019-01-01 2019-01-05
5 1 A 2019-01-05 2019-01-01 2019-01-05
6 2 C 2019-01-06 2019-01-06 2019-01-09
7 2 C 2019-01-07 2019-01-06 2019-01-09
8 2 A 2019-01-08 2019-01-08 2019-01-10
9 2 C 2019-01-09 2019-01-06 2019-01-09
10 2 A 2019-01-10 2019-01-08 2019-01-10
However, this only process the individual provider
once per site
.
Here is what I would like the final dataset to look like:
# A tibble: 10 x 5
# Groups: site, provider [5]
site provider date start end
<dbl> <chr> <date> <date> <date>
1 1 A 2019-01-01 2019-01-01 2019-01-01 # A stops at 2019-01-01
2 1 B 2019-01-02 2019-01-02 2019-01-02
3 1 C 2019-01-03 2019-01-03 2019-01-03
4 1 A 2019-01-04 2019-01-04 2019-01-05 # A restarts from 2019-01-04
5 1 A 2019-01-05 2019-01-04 2019-01-05
6 2 C 2019-01-06 2019-01-06 2019-01-07
7 2 C 2019-01-07 2019-01-06 2019-01-07 # C stops at 2019-01-07
8 2 A 2019-01-08 2019-01-08 2019-01-08 # A stops at 2019-01-08
9 2 C 2019-01-09 2019-01-09 2019-01-09 # C restarts at 2019-01-09
10 2 A 2019-01-10 2019-01-10 2019-01-10 # A restarts at 2019-01-10
Upvotes: 1
Views: 59
Reputation: 887118
An option using only dplyr
where the 'grp' is created by comparing adjacent elements of 'provider', take the cumulative sum of the logical vector and find the min
and max
of 'date' after grouping by 'site', 'provider' and 'grp'
library(dplyr)
x %>%
group_by(site, provider,
grp = cumsum(provider != lag(provider, default = first(provider)))) %>%
mutate(start = min(date), end = max(date)) %>%
ungroup %>%
select(-grp)
# A tibble: 10 x 5
# site provider date start end
# <dbl> <chr> <date> <date> <date>
# 1 1 A 2019-01-01 2019-01-01 2019-01-01
# 2 1 B 2019-01-02 2019-01-02 2019-01-02
3 3 1 C 2019-01-03 2019-01-03 2019-01-03
# 4 1 A 2019-01-04 2019-01-04 2019-01-05
# 5 1 A 2019-01-05 2019-01-04 2019-01-05
# 6 2 C 2019-01-06 2019-01-06 2019-01-07
# 7 2 C 2019-01-07 2019-01-06 2019-01-07
# 8 2 A 2019-01-08 2019-01-08 2019-01-08
# 9 2 C 2019-01-09 2019-01-09 2019-01-09
#10 2 A 2019-01-10 2019-01-10 2019-01-10
Upvotes: 1
Reputation: 388982
Use data.table::rleid
to create an additional group
library(dplyr)
x %>%
group_by(site, provider, group = data.table::rleid(provider)) %>%
mutate(start = min(date), end = max(date)) %>%
ungroup() %>%
select(-group)
# A tibble: 10 x 5
# site provider date start end
# <dbl> <chr> <date> <date> <date>
# 1 1 A 2019-01-01 2019-01-01 2019-01-01
# 2 1 B 2019-01-02 2019-01-02 2019-01-02
# 3 1 C 2019-01-03 2019-01-03 2019-01-03
# 4 1 A 2019-01-04 2019-01-04 2019-01-05
# 5 1 A 2019-01-05 2019-01-04 2019-01-05
# 6 2 C 2019-01-06 2019-01-06 2019-01-07
# 7 2 C 2019-01-07 2019-01-06 2019-01-07
# 8 2 A 2019-01-08 2019-01-08 2019-01-08
# 9 2 C 2019-01-09 2019-01-09 2019-01-09
#10 2 A 2019-01-10 2019-01-10 2019-01-10
Upvotes: 1