Reputation: 131
Take the following example data set
structure(list(id = c("AA2A", "AA2A", "AA2A", "AA2A", "AA2A",
"AA2A", "XX7H", "XX7H", "XX7H", "XX7H", "XX7H", "XX7H"), hr = c(31L,
31L, 31L, 0L, 0L, 50L, 50L, 0L, 0L, 0L, 0L, 84L)), class = "data.frame", row.names = c(NA,
-12L))
I want to add a column which counts the number of occurences of hr>0 in a rolling window, for each id. If we set the window to three, this is thus the desired output:
I have among other tried the following - without success - using the slider package:
mutate(result = slide_dbl(hr,length(hr>0), .before = 3, .after = 0, .complete = FALSE))
mutate(result = slide_dbl(hr,summarise_if(hr>0), .before = 3, .after = 0, .complete = FALSE))
And code like this using the zoo package.
mutate(result=rollapply(hr, width=3, FUN=function(x) length(hr>0), partial=TRUE, align='left'))
The full data set has 1 million rows and 10 columns, and the last code is definitely taking too long.
I have previously used slide_dbl() without issues, when calculating means and medians, and it also works with just length
, but not when I want to add the condition hr>0
.
(The "per id" is not an issue - I normally apply group_by(id)%>%
. I'm working in tidyverse
.)
Upvotes: 2
Views: 93
Reputation: 18712
row_number()
starts at 1 for each by group. So we can use that to construct a sequence of positive indices to create a window and subset hr
:
library(dplyr)
library(purrr)
df |>
mutate(result = map_dbl(row_number(), \(i) sum(hr[keep(seq(i, i - 2), ~ .x > 0)] > 0)), .by = id)
You could write a small helper function to only return positive indices:
library(dplyr)
library(purrr)
pos_seq <- \(start, stop) with(list(s = seq(start, stop)), s[s > 0])
df |>
mutate(result = map_dbl(row_number(), \(i) sum(hr[pos_seq(i, i - 2)] > 0)), .by = id)
Benchmark
Since speed is a factor in your case I would highly recommend not using my answer as the other answers are significantly faster (particularly the answers given by Andre Wildberg and M--):
microbenchmark::microbenchmark(
`Andre Wildberg` = {
dat %>%
mutate(result = rowSums(sapply(0:2, \(x) lag(hr, x) > 0), na.rm=T), .by = id)
},
LMc = {
dat |>
mutate(result = map_dbl(row_number(), \(i) sum(hr[pos_seq(i, i - 2)] > 0)), .by = id)
},
Carl = {
dat |>
mutate(
result = slide_dbl(hr, \(x) sum(x > 0), .before = 2),
.by = id
)
},
`M--` = {
dat %>%
mutate(result = ifelse(row_number() < k,
cumsum(hr > 0),
zoo::rollsum(hr > 0, k, align = "right", fill = NA)),
.by = id)
},
setup = {
k <- 3
dat <- df[sample(seq(nrow(df)), 1E4, replace = T),]
}
)
Unit: milliseconds
expr min lq mean median uq max neval cld
Andre Wildberg 1.844201 2.094402 2.307445 2.229401 2.465251 3.541401 100 a
LMc 130.597801 137.079901 142.593837 140.629501 145.033951 268.651901 100 b
Carl 8.726101 9.211402 10.584763 9.575051 10.186901 29.928601 100 c
M-- 3.181400 3.500501 4.239822 3.659551 4.090751 16.771800 100 a
Note: one issue with this benchmark is it has a few large groups rather than a lot of small groups.
Upvotes: 1
Reputation: 29153
k <- 3
df1 %>%
mutate(result = ifelse(row_number() < k,
cumsum(hr > 0),
zoo::rollsum(hr > 0, k, align = "right", fill = NA)),
.by = id)
#> id hr result
#> 1 AA2A 31 1
#> 2 AA2A 31 2
#> 3 AA2A 31 3
#> 4 AA2A 0 2
#> 5 AA2A 0 1
#> 6 AA2A 50 1
#> 7 XX7H 50 1
#> 8 XX7H 0 1
#> 9 XX7H 0 1
#> 10 XX7H 0 0
#> 11 XX7H 0 0
#> 12 XX7H 84 1
Upvotes: 1
Reputation: 19191
An approach that creates 3 lag
variables and then rowSums
to get the desired values
library(dplyr)
df %>%
mutate(result = rowSums(sapply(0:2, \(x) lag(hr, x) > 0), na.rm=T), .by = id)
output
id hr result
1 AA2A 31 1
2 AA2A 31 2
3 AA2A 31 3
4 AA2A 0 2
5 AA2A 0 1
6 AA2A 50 1
7 XX7H 50 1
8 XX7H 0 1
9 XX7H 0 1
10 XX7H 0 0
11 XX7H 0 0
12 XX7H 84 1
Upvotes: 2
Reputation: 7540
This gives the desired result using an anonymous function:
(x > 0 = TRUE
equates to 1, so may be summed. And .before = 2
plus the current row gives the window of 3.)
library(dplyr)
library(slider)
df <- structure(list(id = c(
"AA2A", "AA2A", "AA2A", "AA2A", "AA2A",
"AA2A", "XX7H", "XX7H", "XX7H", "XX7H", "XX7H", "XX7H"
), hr = c(
31L,
31L, 31L, 0L, 0L, 50L, 50L, 0L, 0L, 0L, 0L, 84L
)), class = "data.frame", row.names = c(
NA,
-12L
))
df |>
mutate(
result = slide_dbl(hr, \(x) sum(x > 0), .before = 2),
.by = id
)
#> id hr result
#> 1 AA2A 31 1
#> 2 AA2A 31 2
#> 3 AA2A 31 3
#> 4 AA2A 0 2
#> 5 AA2A 0 1
#> 6 AA2A 50 1
#> 7 XX7H 50 1
#> 8 XX7H 0 1
#> 9 XX7H 0 1
#> 10 XX7H 0 0
#> 11 XX7H 0 0
#> 12 XX7H 84 1
Created on 2024-05-31 with reprex v2.1.0
Upvotes: 4