Reputation: 434
So I have a data frame in the general structure below:
dataframe:
rownum | group | date |
---|---|---|
1 | a | 2021-05-01 |
2 | a | 2021-05-02 |
3 | a | 2021-05-03 |
4 | b | 2021-05-15 |
5 | b | 2021-05-17 |
6 | b | 2021-05-30 |
7 | b | 2021-05-31 |
8 | b | 2021-05-31 |
9 | c | 2021-05-01 |
10 | c | 2021-05-05 |
What I would like to do is, WITHIN GROUP, compare the first row to the next row, until the difference between the dates meets some threshold, say 10 days. Then, once that row meets the threshold, I'd like to then test the next row against the subsequent row. It would look like this:
Result, using threshold of 10:
|rownum|group |date |date diff|
|------|------|-----------|---|
|1 | a |2021-05-01 |NA|
|2 | a |2021-05-02 |1|
|3 | a |2021-05-03 |2|
|4 | b |2021-05-15 |NA|
|5 | b |2021-05-17 |2|
|6 | b |2021-05-30 |15 (meets criteria, start from row 7 now)|
|7 | b |2021-05-31 | NA|
|8 | b |2021-05-31 | 0|
|9 | c |2021-05-01 | NA|
|10 | c |2021-05-05 | 4|
So to reiterate, its comparing the FIRST row of a group to subsequent rows until some threshold is met. Then the count starts over at the first rep after that within the group to subsequent rows within the group. The difference is recorded as datediff.
I've tried this but I dont know if sapply is the way to go:
dataframe %>%
group_by(group) %>%
mutate(
datediff = sapply(date, function(x) {
all(difftime(dataframe$date,dplyr::lag(dataframe, n = 1, default = NA)))
}
)
)
Also tried this, which I think is closer to what I want:
for (m in 1:length(dataframe)) {
dataframe <- dataframe %>%
group_by(group) %>%
rowwise() %>%
mutate(datediff = difftime(dataframe$date,dplyr::lag(date, n = m, default = NA), units="days"))
}
So far I havent been able to get the right rowwise comparison to even implement the thresholding bit.
Upvotes: 3
Views: 983
Reputation: 39154
Another tidyverse
solution. We can use accumulate
to achieve this task. dat
is from r2evans' example.
library(tidyverse)
dat2 <- dat %>%
group_by(group) %>%
mutate(diff_lag = as.integer(date - lag(date))) %>%
mutate(diff = accumulate(diff_lag, function(x, y){
if (is.na(x)){
res <- y
} else if (x > 10){
res <- NA
} else {
res <- x + y
}
return(res)
})) %>%
select(-diff_lag) %>%
ungroup()
dat2
# # A tibble: 10 x 4
# rownum group date diff
# < int> <chr> <date> <int>
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4
Upvotes: 3
Reputation: 160407
func <- function(x, threshold = 10) {
r <- rle(c(0, diff(x)) > threshold)
if ((len <- length(r$values)) > 1) {
r$lengths[len] <- r$lengths[len] - 1L
r$lengths[1] <- r$lengths[1] + 1L
}
cumsum(inverse.rle(r))
}
dat$group2 <- ave(as.numeric(dat$date), dat$group, FUN = func)
dat$datediff <- ave(as.numeric(dat$date), dat[,c("group", "group2")], FUN = function(x) c(NA, (x - x[1])[-1]))
dat$group2 <- NULL
dat
# rownum group date datediff
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4
library(dplyr)
dat %>%
group_by(group) %>%
mutate(group2 = func(date)) %>%
group_by(group, group2) %>%
mutate(datediff = c(NA, (date - date[1])[-1])) %>%
ungroup() %>%
select(-group2)
# # A tibble: 10 x 4
# rownum group date datediff
# <int> <chr> <date> <dbl>
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4
Data
dat <- structure(list(rownum = 1:10, group = c("a", "a", "a", "b", "b", "b", "b", "b", "c", "c"), date = structure(c(18748, 18749, 18750, 18762, 18764, 18777, 18778, 18778, 18748, 18752), class = "Date")), row.names = c(NA, -10L), class = "data.frame")
(I already converted dat$date
to Date
-class.)
Upvotes: 1
Reputation: 7385
Here's a roundabout way of getting what you're looking for, where some of your NA
are set to 0
using this solution:
library(tidyverse)
df %>%
group_by(group) %>%
mutate(date = as.Date(date),
date_diff = date - first(date),
flag = date_diff > 10) %>%
group_by(group, flag) %>%
mutate(temp_group = cur_group_id()) %>%
group_by(temp_group) %>%
mutate(date_diff = case_when(date_diff == first(date_diff) ~ date_diff,
date_diff != first(date_diff) & date_diff < 10 ~ date - first(date),
date_diff != first(date_diff) & date_diff > 10 ~ date - nth(date, 2))) %>%
ungroup() %>%
select(group, date, date_diff)
# A tibble: 10 x 3
group date date_diff
<chr> <date> <drtn>
1 a 2021-05-01 0 days
2 a 2021-05-02 1 days
3 a 2021-05-03 2 days
4 b 2021-05-15 0 days
5 b 2021-05-17 2 days
6 b 2021-05-30 15 days
7 b 2021-05-31 0 days
8 b 2021-05-31 0 days
9 c 2021-05-01 0 days
10 c 2021-05-05 4 days
Upvotes: 0