Select start from first row and end from last row by group in R

Question

My data has repeated calculations even when nothing has changed. I would like to group these rows, by getting the start date from the first occurrence of the value and the end date for the last consecutive (!) occurrence of the value. When the value occurs again but there was another value in between, this should be considered a new group.

Data:

ID	value	start	end
A	1	2020-01-01	2020-03-01
A	1	2020-03-01	2020-04-01
A	1	2020-06-03	2020-05-01
A	2	2020-06-03	2020-07-04
A	1	2020-07-04

Desired result:

ID	value	start	end
A	1	2020-01-01	2020-05-01
A	2	2020-06-03	2020-07-04
A	1	2020-07-04

#Example data
ID <- c('A','A','A','A','A', 'B', 'B', 'B', 'B')
value <- c(1,1,1,2,1,2,2,3,3)
start <- as.Date(c('2020-1-1','2020-3-1','2020-4-1','2020-6-3','2020-7-4','2020-2-2','2020-3-3','2020-4-4','2020-5-5'))
end <- as.Date(c('2020-3-1','2020-4-1','2020-5-1','2020-7-4',NA,'2020-3-3','2020-4-4','2020-5-5',NA))

my.data <- data.frame(ID, value, start, end)

#result
ID <- c('A','A','A', 'B', 'B')
value <- c(1,2,1,2,3)
start <- as.Date(c('2020-1-1','2020-6-3','2020-7-4','2020-2-2','2020-4-4'))
end <- as.Date(c('2020-5-1','2020-7-4',NA,'2020-4-4',NA))

my.result <- data.frame(ID, value, start, end)

ktiu · Accepted Answer

Here is my approach with dplyr and rle():

library(dplyr)

my.data %>%
  mutate(rlid = rep(seq_along(rle(value)$values), rle(value)$lengths)) %>%
  group_by(rlid, ID) %>%
  summarize(value = first(value), start = min(start), end = max(end)) %>%
  ungroup() %>%
  select(-rlid)

Returns:

# A tibble: 5 x 4
  ID    value start      end       
             
1 A         1 2020-01-01 2020-05-01
2 A         2 2020-06-03 2020-07-04
3 A         1 2020-07-04 NA        
4 B         2 2020-02-02 2020-04-04
5 B         3 2020-04-04 NA

(Data used)

my.data <- structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B", "B"), value = c(1, 1, 1, 2, 1, 2, 2, 3, 3), start = structure(c(18262, 18322, 18353, 18416, 18447, 18294, 18324, 18356, 18387), class = "Date"), end = structure(c(18322, 18353, 18383, 18447, NA, 18324, 18356, 18387, NA), class = "Date")), class = "data.frame", row.names = c(NA, -9L))

Select start from first row and end from last row by group in R

Answers (1)

Related Questions