Reputation: 117
I want to get rolling means for the past 1 to 10 events grouped by a column for multiple columns. I also want it very fast such as in dplyr or data.table because I want to run this on a 1,000,000 x 1,000 dataframe.
starting df
data.table(a = c("bill", "bob", "bill", "bob", "bill", "bob"),
b = c(1,2,1,1,3,2),
c = c(2,3,9,1,4,1),
d = c(4,5,1,7,3,4))
1: bill 1 2 4
2: bob 2 3 5
3: bill 1 9 1
4: bob 1 1 7
5: bill 3 4 3
6: bob 2 1 4
desired df
I want the rolling mean of only b and c grouped by column a with a window of 1 to 10 for each column lagged 1 row.
a b c d b_roll1 c_roll1 b_roll2 c_roll2 b_roll3 c_roll3
1: bill 1 2 4 NA NA NA NA NA NA
2: bob 2 3 5 NA NA NA NA NA NA
3: bill 1 9 1 1 2 1 2 1 2
4: bob 1 1 7 2 3 2 3 2 3
5: bill 3 4 3 1 9 1 5.5 1 5.5
6: bob 2 1 4 1 1 1 2 1 2
Upvotes: 0
Views: 404
Reputation: 35307
Your example outcome doesn't make too much sense to me, but here is an example on how you can generate many mutate
calls programmatically.
An extendable solution using lazyeval
and RcppRoll
:
library(tidyverse)
vars <- c('b', 'c')
ns <- 1:10
com <- expand.grid(vars, ns, stringsAsFactors = FALSE)
dots <- map2(com[[1]], com[[2]],
~lazyeval::interp(~RcppRoll::roll_meanr(x, y, fill = NA), x = as.name(.x), y = .y))
names(dots) <- apply(com, 1, paste0, collapse = '_')
D %>%
group_by(a) %>%
mutate_(.dots = dots)
Gives:
Source: local data frame [6 x 24]
Groups: a [2]
a b c d `b_ 1` `c_ 1` `b_ 2` `c_ 2` `b_ 3` `c_ 3` `b_ 4` `c_ 4` `b_ 5` `c_ 5` `b_ 6` `c_ 6` `b_ 7` `c_ 7` `b_ 8` `c_ 8` `b_ 9`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bill 1 2 4 1 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 bob 2 3 5 2 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 bill 1 2 1 1 2 1.0 2 NA NA NA NA NA NA NA NA NA NA NA NA NA
4 bob 1 1 7 1 1 1.5 2 NA NA NA NA NA NA NA NA NA NA NA NA NA
5 bill 3 4 3 3 4 2.0 3 1.666667 2.666667 NA NA NA NA NA NA NA NA NA NA NA
6 bob 2 1 4 2 1 1.5 1 1.666667 1.666667 NA NA NA NA NA NA NA NA NA NA NA
# ... with 3 more variables: `c_ 9` <dbl>, b_10 <dbl>, c_10 <dbl>
Upvotes: 2
Reputation: 3242
I am still not completely following you. It seems that you apply a combination of a lag and a rolled mean. For just the rolled mean this is a solution using dplyr
and RcppRoll
.
roll_mean_na <- function(x, lag){
c(rep(NA, lag - 1), RcppRoll::roll_mean(x, lag, align = "left"))
}
library(dplyr)
df %>% group_by(a) %>%
mutate(b_2 = roll_mean_na(b, 2), c_2 = roll_mean_na(c, 2),
b_3 = roll_mean_na(b, 3), c_3 = roll_mean_na(c, 3),
b_4 = roll_mean_na(b, 4), c_4 = roll_mean_na(c, 4))
Upvotes: 1