SqueakyBeak
SqueakyBeak

Reputation: 434

How to compare within a group the first value to each subsequent value until a condition is met

So I have a data frame in the general structure below:

dataframe:

rownum group date
1 a 2021-05-01
2 a 2021-05-02
3 a 2021-05-03
4 b 2021-05-15
5 b 2021-05-17
6 b 2021-05-30
7 b 2021-05-31
8 b 2021-05-31
9 c 2021-05-01
10 c 2021-05-05

What I would like to do is, WITHIN GROUP, compare the first row to the next row, until the difference between the dates meets some threshold, say 10 days. Then, once that row meets the threshold, I'd like to then test the next row against the subsequent row. It would look like this:

Result, using threshold of 10:

|rownum|group |date       |date diff|
|------|------|-----------|---|        
|1     | a    |2021-05-01 |NA|
|2     | a    |2021-05-02 |1|
|3     | a    |2021-05-03 |2|
|4     | b    |2021-05-15 |NA|
|5     | b    |2021-05-17 |2|
|6     | b    |2021-05-30 |15 (meets criteria, start from row 7 now)|
|7     | b    |2021-05-31 | NA|
|8     | b    |2021-05-31 | 0|
|9     | c    |2021-05-01 | NA|
|10    | c    |2021-05-05 |  4|

So to reiterate, its comparing the FIRST row of a group to subsequent rows until some threshold is met. Then the count starts over at the first rep after that within the group to subsequent rows within the group. The difference is recorded as datediff.

I've tried this but I dont know if sapply is the way to go:

dataframe %>% 
    group_by(group) %>%
    mutate(
        datediff = sapply(date, function(x) {
            all(difftime(dataframe$date,dplyr::lag(dataframe, n = 1, default = NA)))
                }
        )
    )

Also tried this, which I think is closer to what I want:

for (m in 1:length(dataframe)) {
    dataframe <- dataframe %>% 
        group_by(group) %>% 
        rowwise() %>% 
        mutate(datediff = difftime(dataframe$date,dplyr::lag(date, n = m, default = NA), units="days"))
    }

So far I havent been able to get the right rowwise comparison to even implement the thresholding bit.

Upvotes: 3

Views: 983

Answers (3)

www
www

Reputation: 39154

Another tidyverse solution. We can use accumulate to achieve this task. dat is from r2evans' example.

library(tidyverse)

dat2 <- dat %>%
  group_by(group) %>%
  mutate(diff_lag = as.integer(date - lag(date))) %>%
  mutate(diff = accumulate(diff_lag, function(x, y){
    if (is.na(x)){
      res <- y
    }  else if (x > 10){
      res <- NA
    } else {
      res <- x + y
    }
    return(res)
  })) %>%
  select(-diff_lag) %>%
  ungroup()

dat2
# # A tibble: 10 x 4
#    rownum group date        diff
# <    int> <chr> <date>     <int>
#  1      1 a     2021-05-01    NA
#  2      2 a     2021-05-02     1
#  3      3 a     2021-05-03     2
#  4      4 b     2021-05-15    NA
#  5      5 b     2021-05-17     2
#  6      6 b     2021-05-30    15
#  7      7 b     2021-05-31    NA
#  8      8 b     2021-05-31     0
#  9      9 c     2021-05-01    NA
# 10     10 c     2021-05-05     4

Upvotes: 3

r2evans
r2evans

Reputation: 160407

base R

func <- function(x, threshold = 10) {
  r <- rle(c(0, diff(x)) > threshold)
  if ((len <- length(r$values)) > 1) {
    r$lengths[len] <- r$lengths[len] - 1L
    r$lengths[1] <- r$lengths[1] + 1L
  }
  cumsum(inverse.rle(r))
}
dat$group2 <- ave(as.numeric(dat$date), dat$group, FUN = func)
dat$datediff <- ave(as.numeric(dat$date), dat[,c("group", "group2")], FUN = function(x) c(NA, (x - x[1])[-1]))
dat$group2 <- NULL
dat
#    rownum group       date datediff
# 1       1     a 2021-05-01       NA
# 2       2     a 2021-05-02        1
# 3       3     a 2021-05-03        2
# 4       4     b 2021-05-15       NA
# 5       5     b 2021-05-17        2
# 6       6     b 2021-05-30       15
# 7       7     b 2021-05-31       NA
# 8       8     b 2021-05-31        0
# 9       9     c 2021-05-01       NA
# 10     10     c 2021-05-05        4

dplyr

library(dplyr)
dat %>%
  group_by(group) %>%
  mutate(group2 = func(date)) %>%
  group_by(group, group2) %>%
  mutate(datediff = c(NA, (date - date[1])[-1])) %>%
  ungroup() %>%
  select(-group2)
# # A tibble: 10 x 4
#    rownum group date       datediff
#     <int> <chr> <date>        <dbl>
#  1      1 a     2021-05-01       NA
#  2      2 a     2021-05-02        1
#  3      3 a     2021-05-03        2
#  4      4 b     2021-05-15       NA
#  5      5 b     2021-05-17        2
#  6      6 b     2021-05-30       15
#  7      7 b     2021-05-31       NA
#  8      8 b     2021-05-31        0
#  9      9 c     2021-05-01       NA
# 10     10 c     2021-05-05        4

Data

dat <- structure(list(rownum = 1:10, group = c("a", "a", "a", "b", "b", "b", "b", "b", "c", "c"), date = structure(c(18748, 18749, 18750, 18762, 18764, 18777, 18778, 18778, 18748, 18752), class = "Date")), row.names = c(NA, -10L), class = "data.frame")

(I already converted dat$date to Date-class.)

Upvotes: 1

Matt
Matt

Reputation: 7385

Here's a roundabout way of getting what you're looking for, where some of your NA are set to 0 using this solution:

library(tidyverse)

df %>% 
  group_by(group) %>% 
  mutate(date = as.Date(date),
         date_diff = date - first(date),
         flag = date_diff > 10) %>% 
  group_by(group, flag) %>% 
  mutate(temp_group = cur_group_id()) %>% 
  group_by(temp_group) %>% 
  mutate(date_diff = case_when(date_diff == first(date_diff) ~ date_diff,
                               date_diff != first(date_diff) & date_diff < 10 ~ date - first(date),
                               date_diff != first(date_diff) & date_diff > 10 ~ date - nth(date, 2))) %>% 
  ungroup() %>% 
  select(group, date, date_diff) 


# A tibble: 10 x 3
   group date       date_diff
   <chr> <date>     <drtn>   
 1 a     2021-05-01  0 days  
 2 a     2021-05-02  1 days  
 3 a     2021-05-03  2 days  
 4 b     2021-05-15  0 days  
 5 b     2021-05-17  2 days  
 6 b     2021-05-30 15 days  
 7 b     2021-05-31  0 days  
 8 b     2021-05-31  0 days  
 9 c     2021-05-01  0 days  
10 c     2021-05-05  4 days 

Upvotes: 0

Related Questions