Reputation: 444
My data has repeated calculations even when nothing has changed. I would like to group these rows, by getting the start date from the first occurrence of the value and the end date for the last consecutive (!) occurrence of the value. When the value occurs again but there was another value in between, this should be considered a new group.
Data:
ID | value | start | end |
---|---|---|---|
A | 1 | 2020-01-01 | 2020-03-01 |
A | 1 | 2020-03-01 | 2020-04-01 |
A | 1 | 2020-06-03 | 2020-05-01 |
A | 2 | 2020-06-03 | 2020-07-04 |
A | 1 | 2020-07-04 |
Desired result:
ID | value | start | end |
---|---|---|---|
A | 1 | 2020-01-01 | 2020-05-01 |
A | 2 | 2020-06-03 | 2020-07-04 |
A | 1 | 2020-07-04 |
#Example data
ID <- c('A','A','A','A','A', 'B', 'B', 'B', 'B')
value <- c(1,1,1,2,1,2,2,3,3)
start <- as.Date(c('2020-1-1','2020-3-1','2020-4-1','2020-6-3','2020-7-4','2020-2-2','2020-3-3','2020-4-4','2020-5-5'))
end <- as.Date(c('2020-3-1','2020-4-1','2020-5-1','2020-7-4',NA,'2020-3-3','2020-4-4','2020-5-5',NA))
my.data <- data.frame(ID, value, start, end)
#result
ID <- c('A','A','A', 'B', 'B')
value <- c(1,2,1,2,3)
start <- as.Date(c('2020-1-1','2020-6-3','2020-7-4','2020-2-2','2020-4-4'))
end <- as.Date(c('2020-5-1','2020-7-4',NA,'2020-4-4',NA))
my.result <- data.frame(ID, value, start, end)
Upvotes: 0
Views: 161
Reputation: 2626
Here is my approach with dplyr
and rle()
:
library(dplyr)
my.data %>%
mutate(rlid = rep(seq_along(rle(value)$values), rle(value)$lengths)) %>%
group_by(rlid, ID) %>%
summarize(value = first(value), start = min(start), end = max(end)) %>%
ungroup() %>%
select(-rlid)
Returns:
# A tibble: 5 x 4
ID value start end
<chr> <dbl> <date> <date>
1 A 1 2020-01-01 2020-05-01
2 A 2 2020-06-03 2020-07-04
3 A 1 2020-07-04 NA
4 B 2 2020-02-02 2020-04-04
5 B 3 2020-04-04 NA
(Data used)
my.data <- structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B", "B"), value = c(1, 1, 1, 2, 1, 2, 2, 3, 3), start = structure(c(18262, 18322, 18353, 18416, 18447, 18294, 18324, 18356, 18387), class = "Date"), end = structure(c(18322, 18353, 18383, 18447, NA, 18324, 18356, 18387, NA), class = "Date")), class = "data.frame", row.names = c(NA, -9L))
Upvotes: 1