Reputation: 151
I would like to get rolling average for each of the numeric variables that I have. Using data.table package, I know how to compute for a single variable. But how should I revise the code so it can process multiple variables at a time rather than revising the variable name and repeat this procedure for several times? Thanks.
Suppose I have other numeric variables named as "V2", "V3", and "V4".
require(data.table)
setDT(data)
setkey(data,Receptor,date)
data[ , `:=` ('RollConc' = rollmean(AvgConc, 48, align="left", na.pad=TRUE)) , by=Receptor]
A copy of my sample data can be found at: https://drive.google.com/file/d/0B86_a8ltyoL3OE9KTUstYmRRbFk/view?usp=sharing
I would like to get 5-hour rolling means for "AvgConc","TotDep","DryDep", and "WetDep" by each receptor.
Upvotes: 13
Views: 12529
Reputation: 16727
Cliff did not correct his answer despite my comment was made long time ago, therefore submitting proper usage of vectorized arguments in frollmean
DT[, paste0("r_", vars) := frollmean(.SD, n=w, align=align), by = g, .SDcols = vars]
and a benchmark
library(data.table)
set.seed(42)
n = 1e6
DT = data.table(V1 = rnorm(n), V2 = rlnorm(n), V3 = runif(n), g = rep(c("a","b"),n/2), key="g")
vars = c("V1","V2","V3")
library(zoo)
w = 48
align = "left"
roland = function(DT) DT[, paste0("r_", vars) := lapply(.SD, rollmean, k=w, na.pad=TRUE, align=align), by = g, .SDcols = vars]
cliff = function(DT) DT[, paste0("r_", vars) := lapply(.SD, frollmean, n=w, fill=NA, align=align), by = g, .SDcols = vars]
jan = function(DT) DT[, paste0("r_", vars) := frollmean(.SD, n=w, align=align), by = g, .SDcols = vars]
DT1 = copy(DT)
DT2 = copy(DT)
DT3 = copy(DT)
system.time(a1 <- roland(DT1))
# user system elapsed
# 0.776 0.079 0.774
system.time(a2 <- cliff(DT2))
# user system elapsed
# 0.102 0.000 0.029
system.time(a3 <- jan(DT3))
# user system elapsed
# 0.064 0.000 0.023
all.equal(DT1, DT2)
#[1] TRUE
all.equal(DT1, DT3)
#[1] TRUE
Most importantly, the more variables, the faster jan
than cliff
will be as they are computed in parallel.
This can be easily observed if we scale up dataset and remove [.data.table
overhead from the benchmark:
library(data.table)
set.seed(42)
n = 1e7
nth = 8
DT = as.data.table(replicate(nth, rnorm(n)))
setDTthreads(nth)
w = 48
align = "left"
DT2 = copy(DT)
DT3 = copy(DT)
system.time(a2 <- lapply(DT2, frollmean, n=w, align=align))
# user system elapsed
# 0.316 0.244 0.561
system.time(a3 <- frollmean(DT2, w, align=align))
# user system elapsed
# 1.064 0.357 0.183
Upvotes: 1
Reputation: 6222
Now, one can use the frollmean
function in the data.table
package for this.
library(data.table)
xy <- c("x", "y")
DT[, (xy):= lapply(.SD, frollmean, n = 3, fill = NA, align="center"),
by = g, .SDcols = xy]
Here, I am replacing the x and y columns by the rolling average.
# Data
set.seed(42)
DT <- data.table(x = rnorm(10), y = rlnorm(10), z = runif(10),
g = c("a", "b"), key = "g")
Upvotes: 13
Reputation: 132969
From your description you want something like this, which is similar to one example that can be found in one of the data.table vignettes:
library(data.table)
set.seed(42)
DT <- data.table(x = rnorm(10), y = rlnorm(10), z = runif(10), g = c("a", "b"), key = "g")
library(zoo)
DT[, paste0("ravg_", c("x", "y")) := lapply(.SD, rollmean, k = 3, na.pad = TRUE),
by = g, .SDcols = c("x", "y")]
Upvotes: 20