Reputation: 2783
I am trying to find sequences in a list of dates and convert them to a start and an end date.
An example of my data looks as follows:
id date
1 1 2020-01-01
2 1 2020-01-02
3 1 2020-01-03
4 1 2020-01-06
5 1 2020-01-07
6 2 2020-01-02
7 2 2020-01-03
8 2 2020-01-04
9 2 2020-01-05
10 3 2020-01-04
11 3 2020-01-07
What I would like to create is the following table:
id start date end date
1 1 2020-01-01 2020-01-03
2 1 2020-01-06 2020-01-07
3 2 2020-01-02 2020-01-05
4 3 2020-01-04 2020-01-04
5 3 2020-01-07 2020-01-07
I have been fiddling around with the diff function but I can't quite get it to work the way I want.
Upvotes: 1
Views: 135
Reputation: 2777
One-liner (but not really different to other answers)
library(data.table)
dt <- data.table(
id = c(1,1,1,1,1,2,2,2,2,3,3),
date = lubridate::ymd('2020-01-01')+c(0:2,5,6,1:4,3,6))
# calculate
dt[, .(start = date[c(T, x <- diff(date) != 1)], end = date[c(x, T)]), id]
#> id start end
#> 1: 1 2020-01-01 2020-01-03
#> 2: 1 2020-01-06 2020-01-07
#> 3: 2 2020-01-02 2020-01-05
#> 4: 3 2020-01-04 2020-01-04
#> 5: 3 2020-01-07 2020-01-07
Upvotes: 0
Reputation: 73437
A base R approach using by
and rle
.
res <- do.call(rbind, by(DF, DF$id, function(x) {
cbind(id=x[1,1], setNames(
do.call(rbind, Map(function(i, j) data.frame(i, i + j),
x[c(0, diff(x[,2])) != 1, 2],
rle(cumsum(c(0, diff(x[,2])) != 1))$lengths - 1
)), c("start", "end")))
}))
res
# id start end
# 1.1 1 2020-01-01 2020-01-03
# 1.2 1 2020-01-06 2020-01-07
# 2 2 2020-01-02 2020-01-05
# 3.1 3 2020-01-04 2020-01-04
# 3.2 3 2020-01-07 2020-01-07
Data:
DF <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L), date = structure(c(18262, 18263, 18264, 18267, 18268, 18263,
18264, 18265, 18266, 18265, 18268), class = "Date")), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"), class = "data.frame")
Upvotes: 1
Reputation: 40161
One dplyr
option could be:
df %>%
group_by(id) %>%
mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
group_by(id, grp = cumsum(c(0, !diff(date) %in% 0:1))) %>%
summarise(start_date = min(date),
end_date = max(date))
id grp start_date end_date
<int> <dbl> <date> <date>
1 1 0 2020-01-01 2020-01-03
2 1 1 2020-01-06 2020-01-07
3 2 0 2020-01-02 2020-01-05
4 3 0 2020-01-04 2020-01-04
5 3 1 2020-01-07 2020-01-07
Upvotes: 0
Reputation: 33548
DT[, grp := cumsum(date - shift(date, 1L, fill = date[1]) > 1), by = id]
DT[, .(start_date = date[1], end_date = date[.N]), by = .(id, grp)][, !"grp"]
# id start_date end_date
# 1: 1 2020-01-01 2020-01-03
# 2: 1 2020-01-06 2020-01-07
# 3: 2 2020-01-02 2020-01-05
# 4: 3 2020-01-04 2020-01-04
# 5: 3 2020-01-07 2020-01-07
Reproducible data
DT <- data.table(
id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L),
date = structure(
c(18262, 18263, 18264, 18267, 18268, 18263, 18264, 18265, 18266, 18265, 18268),
class = "Date"
)
)
Upvotes: 1