SarahDew
SarahDew

Reputation: 444

Select start from first row and end from last row by group in R

My data has repeated calculations even when nothing has changed. I would like to group these rows, by getting the start date from the first occurrence of the value and the end date for the last consecutive (!) occurrence of the value. When the value occurs again but there was another value in between, this should be considered a new group.

Data:

ID value start end
A 1 2020-01-01 2020-03-01
A 1 2020-03-01 2020-04-01
A 1 2020-06-03 2020-05-01
A 2 2020-06-03 2020-07-04
A 1 2020-07-04

Desired result:

ID value start end
A 1 2020-01-01 2020-05-01
A 2 2020-06-03 2020-07-04
A 1 2020-07-04
#Example data
ID <- c('A','A','A','A','A', 'B', 'B', 'B', 'B')
value <- c(1,1,1,2,1,2,2,3,3)
start <- as.Date(c('2020-1-1','2020-3-1','2020-4-1','2020-6-3','2020-7-4','2020-2-2','2020-3-3','2020-4-4','2020-5-5'))
end <- as.Date(c('2020-3-1','2020-4-1','2020-5-1','2020-7-4',NA,'2020-3-3','2020-4-4','2020-5-5',NA))

my.data <- data.frame(ID, value, start, end)

#result
ID <- c('A','A','A', 'B', 'B')
value <- c(1,2,1,2,3)
start <- as.Date(c('2020-1-1','2020-6-3','2020-7-4','2020-2-2','2020-4-4'))
end <- as.Date(c('2020-5-1','2020-7-4',NA,'2020-4-4',NA))

my.result <- data.frame(ID, value, start, end)

Upvotes: 0

Views: 161

Answers (1)

ktiu
ktiu

Reputation: 2626

Here is my approach with dplyr and rle():

library(dplyr)

my.data %>%
  mutate(rlid = rep(seq_along(rle(value)$values), rle(value)$lengths)) %>%
  group_by(rlid, ID) %>%
  summarize(value = first(value), start = min(start), end = max(end)) %>%
  ungroup() %>%
  select(-rlid)

Returns:

# A tibble: 5 x 4
  ID    value start      end       
  <chr> <dbl> <date>     <date>    
1 A         1 2020-01-01 2020-05-01
2 A         2 2020-06-03 2020-07-04
3 A         1 2020-07-04 NA        
4 B         2 2020-02-02 2020-04-04
5 B         3 2020-04-04 NA        

(Data used)

my.data <- structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B", "B"), value = c(1, 1, 1, 2, 1, 2, 2, 3, 3), start = structure(c(18262, 18322, 18353, 18416, 18447, 18294, 18324, 18356, 18387), class = "Date"), end = structure(c(18322, 18353, 18383, 18447, NA, 18324, 18356, 18387, NA), class = "Date")), class = "data.frame", row.names = c(NA, -9L))

Upvotes: 1

Related Questions