Bettina
Bettina

Reputation: 131

How to conditionally count no. of occurrences in sliding window?

Take the following example data set

structure(list(id = c("AA2A", "AA2A", "AA2A", "AA2A", "AA2A", 
"AA2A", "XX7H", "XX7H", "XX7H", "XX7H", "XX7H", "XX7H"), hr = c(31L, 
31L, 31L, 0L, 0L, 50L, 50L, 0L, 0L, 0L, 0L, 84L)), class = "data.frame", row.names = c(NA, 
-12L))

I want to add a column which counts the number of occurences of hr>0 in a rolling window, for each id. If we set the window to three, this is thus the desired output:

desired output

I have among other tried the following - without success - using the slider package:

mutate(result = slide_dbl(hr,length(hr>0), .before = 3, .after = 0, .complete = FALSE))
mutate(result = slide_dbl(hr,summarise_if(hr>0), .before = 3, .after = 0, .complete = FALSE))

And code like this using the zoo package.

mutate(result=rollapply(hr, width=3, FUN=function(x) length(hr>0), partial=TRUE, align='left'))

The full data set has 1 million rows and 10 columns, and the last code is definitely taking too long.

I have previously used slide_dbl() without issues, when calculating means and medians, and it also works with just length, but not when I want to add the condition hr>0.

(The "per id" is not an issue - I normally apply group_by(id)%>%. I'm working in tidyverse.)

Upvotes: 2

Views: 93

Answers (4)

LMc
LMc

Reputation: 18712

row_number() starts at 1 for each by group. So we can use that to construct a sequence of positive indices to create a window and subset hr:

library(dplyr)
library(purrr)

df |>
  mutate(result = map_dbl(row_number(), \(i) sum(hr[keep(seq(i, i - 2), ~ .x > 0)] > 0)), .by = id)

You could write a small helper function to only return positive indices:

library(dplyr)
library(purrr)

pos_seq <- \(start, stop) with(list(s = seq(start, stop)), s[s > 0])

df |>
  mutate(result = map_dbl(row_number(), \(i) sum(hr[pos_seq(i, i - 2)] > 0)), .by = id)

Benchmark

Since speed is a factor in your case I would highly recommend not using my answer as the other answers are significantly faster (particularly the answers given by Andre Wildberg and M--):

microbenchmark::microbenchmark(
  `Andre Wildberg` = {
    dat %>% 
      mutate(result = rowSums(sapply(0:2, \(x) lag(hr, x) > 0), na.rm=T), .by = id)
  },
  LMc = {
    dat |>
      mutate(result = map_dbl(row_number(), \(i) sum(hr[pos_seq(i, i - 2)] > 0)), .by = id)
  },
  Carl = {
    dat |>
      mutate(
        result = slide_dbl(hr, \(x) sum(x > 0), .before = 2),
        .by = id
      )
  },
  `M--` = { 
    
    dat %>% 
    mutate(result = ifelse(row_number() < k,
                           cumsum(hr > 0),
                           zoo::rollsum(hr > 0, k, align = "right", fill = NA)), 
           .by = id)
  },
  setup = {
    k <- 3
    dat <- df[sample(seq(nrow(df)), 1E4, replace = T),]
  }
)

Unit: milliseconds
           expr        min         lq       mean     median         uq        max neval cld
 Andre Wildberg   1.844201   2.094402   2.307445   2.229401   2.465251   3.541401   100 a  
            LMc 130.597801 137.079901 142.593837 140.629501 145.033951 268.651901   100  b 
           Carl   8.726101   9.211402  10.584763   9.575051  10.186901  29.928601   100   c
            M--   3.181400   3.500501   4.239822   3.659551   4.090751  16.771800   100 a  

Note: one issue with this benchmark is it has a few large groups rather than a lot of small groups.

Upvotes: 1

M--
M--

Reputation: 29153

k <- 3

df1 %>% 
  mutate(result = ifelse(row_number() < k,
                         cumsum(hr > 0),
                         zoo::rollsum(hr > 0, k, align = "right", fill = NA)), 
         .by = id)
#>      id hr result
#> 1  AA2A 31      1
#> 2  AA2A 31      2
#> 3  AA2A 31      3
#> 4  AA2A  0      2
#> 5  AA2A  0      1
#> 6  AA2A 50      1
#> 7  XX7H 50      1
#> 8  XX7H  0      1
#> 9  XX7H  0      1
#> 10 XX7H  0      0
#> 11 XX7H  0      0
#> 12 XX7H 84      1

Upvotes: 1

Andre Wildberg
Andre Wildberg

Reputation: 19191

An approach that creates 3 lag variables and then rowSums to get the desired values

library(dplyr)

df %>% 
  mutate(result = rowSums(sapply(0:2, \(x) lag(hr, x) > 0), na.rm=T), .by = id)

output

     id hr result
1  AA2A 31      1
2  AA2A 31      2
3  AA2A 31      3
4  AA2A  0      2
5  AA2A  0      1
6  AA2A 50      1
7  XX7H 50      1
8  XX7H  0      1
9  XX7H  0      1
10 XX7H  0      0
11 XX7H  0      0
12 XX7H 84      1

Upvotes: 2

Carl
Carl

Reputation: 7540

This gives the desired result using an anonymous function:

(x > 0 = TRUE equates to 1, so may be summed. And .before = 2 plus the current row gives the window of 3.)

library(dplyr)
library(slider)

df <- structure(list(id = c(
  "AA2A", "AA2A", "AA2A", "AA2A", "AA2A",
  "AA2A", "XX7H", "XX7H", "XX7H", "XX7H", "XX7H", "XX7H"
), hr = c(
  31L,
  31L, 31L, 0L, 0L, 50L, 50L, 0L, 0L, 0L, 0L, 84L
)), class = "data.frame", row.names = c(
  NA,
  -12L
))

df |>
  mutate(
    result = slide_dbl(hr, \(x) sum(x > 0), .before = 2),
    .by = id
  )
#>      id hr result
#> 1  AA2A 31      1
#> 2  AA2A 31      2
#> 3  AA2A 31      3
#> 4  AA2A  0      2
#> 5  AA2A  0      1
#> 6  AA2A 50      1
#> 7  XX7H 50      1
#> 8  XX7H  0      1
#> 9  XX7H  0      1
#> 10 XX7H  0      0
#> 11 XX7H  0      0
#> 12 XX7H 84      1

Created on 2024-05-31 with reprex v2.1.0

Upvotes: 4

Related Questions