Reputation: 91
I have the following sample data:
require(tibble)
sample_data <- tibble(
emp_name = c("john", "john", "john", "john","john","john", "john"),
task = c("carpenter", "carpenter","carpenter", "painter", "painter", "carpenter", "carpenter"),
date_stamp = c("2019-01-01","2019-01-02", "2019-01-03", "2019-01-07", "2019-01-08", "2019-01-30", "2019-02-02")
)
For which I need to aggregate into intervals based on dates.
Rules are: if the next date_stamp listed for the same attribute has no date between, then it should be aggregated. Otherwise, date_stamp_from and date_stamp_to should equal date_stamp.
desired_result <- tibble(
emp_name = c("john", "john","john", "john"),
task = c("carpenter","painter", "carpenter", "carpenter"),
date_stamp_from = c("2019-01-01","2019-01-07", "2019-01-30", "2019-02-02"),
date_stamp_to = c("2019-01-03","2019-01-08", "2019-01-30", "2019-02-02"),
count_dates = c(3,2,1,1)
)
What would be the most efficient way to solve this? Original dataset is ca 10000 records.
Upvotes: 1
Views: 157
Reputation: 388817
We can use diff
and cumsum
to create groups and count first
, last
date_stamp
and number of rows in each group.
library(dplyr)
sample_data %>%
mutate(date_stamp = as.Date(date_stamp)) %>%
group_by(gr = cumsum(c(TRUE, diff(date_stamp) > 1))) %>%
mutate(date_stamp_from = first(date_stamp),
date_stamp_to = last(date_stamp),
count_dates = n()) %>%
slice(1L) %>%
ungroup() %>%
select(-gr, -date_stamp)
# A tibble: 4 x 5
# emp_name task date_stamp_from date_stamp_to count_dates
# <chr> <chr> <date> <date> <int>
#1 john carpenter 2019-01-01 2019-01-03 3
#2 john painter 2019-01-07 2019-01-08 2
#3 john carpenter 2019-01-30 2019-01-30 1
#4 john carpenter 2019-02-02 2019-02-02 1
Upvotes: 2