Vicki1227
Vicki1227

Reputation: 151

rolling average to multiple variables in R using data.table package

I would like to get rolling average for each of the numeric variables that I have. Using data.table package, I know how to compute for a single variable. But how should I revise the code so it can process multiple variables at a time rather than revising the variable name and repeat this procedure for several times? Thanks.

Suppose I have other numeric variables named as "V2", "V3", and "V4".

require(data.table)
setDT(data)
setkey(data,Receptor,date)
data[ , `:=` ('RollConc' = rollmean(AvgConc, 48, align="left", na.pad=TRUE)) , by=Receptor]

A copy of my sample data can be found at: https://drive.google.com/file/d/0B86_a8ltyoL3OE9KTUstYmRRbFk/view?usp=sharing

I would like to get 5-hour rolling means for "AvgConc","TotDep","DryDep", and "WetDep" by each receptor.

Upvotes: 13

Views: 12529

Answers (3)

jangorecki
jangorecki

Reputation: 16727

Cliff did not correct his answer despite my comment was made long time ago, therefore submitting proper usage of vectorized arguments in frollmean

DT[, paste0("r_", vars) := frollmean(.SD, n=w, align=align), by = g, .SDcols = vars]

and a benchmark

library(data.table)
set.seed(42)
n = 1e6
DT = data.table(V1 = rnorm(n), V2 = rlnorm(n), V3 = runif(n), g = rep(c("a","b"),n/2), key="g")
vars = c("V1","V2","V3")
library(zoo)
w = 48
align = "left"

roland = function(DT) DT[, paste0("r_", vars) := lapply(.SD, rollmean, k=w, na.pad=TRUE, align=align), by = g, .SDcols = vars]
cliff = function(DT) DT[, paste0("r_", vars) := lapply(.SD, frollmean, n=w, fill=NA, align=align), by = g, .SDcols = vars]                                   
jan = function(DT) DT[, paste0("r_", vars) := frollmean(.SD, n=w, align=align), by = g, .SDcols = vars]

DT1 = copy(DT)
DT2 = copy(DT)
DT3 = copy(DT)
system.time(a1 <- roland(DT1))
#   user  system elapsed 
#  0.776   0.079   0.774
system.time(a2 <- cliff(DT2))
#   user  system elapsed 
#  0.102   0.000   0.029 
system.time(a3 <- jan(DT3))
#   user  system elapsed 
#  0.064   0.000   0.023

all.equal(DT1, DT2)
#[1] TRUE
all.equal(DT1, DT3)
#[1] TRUE

Most importantly, the more variables, the faster jan than cliff will be as they are computed in parallel.


This can be easily observed if we scale up dataset and remove [.data.table overhead from the benchmark:

library(data.table)
set.seed(42)
n = 1e7
nth = 8
DT = as.data.table(replicate(nth, rnorm(n)))
setDTthreads(nth)
w = 48
align = "left"
DT2 = copy(DT)
DT3 = copy(DT)

system.time(a2 <- lapply(DT2, frollmean, n=w, align=align))
#   user  system elapsed 
#  0.316   0.244   0.561 
system.time(a3 <- frollmean(DT2, w, align=align))
#   user  system elapsed 
#  1.064   0.357   0.183 

Upvotes: 1

kangaroo_cliff
kangaroo_cliff

Reputation: 6222

Now, one can use the frollmean function in the data.table package for this.

library(data.table)    
xy <- c("x", "y")
DT[, (xy):= lapply(.SD, frollmean, n = 3, fill = NA, align="center"), 
                                   by = g, .SDcols =  xy]

Here, I am replacing the x and y columns by the rolling average.


# Data
set.seed(42)
DT <- data.table(x = rnorm(10), y = rlnorm(10), z = runif(10), 
                                g = c("a", "b"), key = "g")

Upvotes: 13

Roland
Roland

Reputation: 132969

From your description you want something like this, which is similar to one example that can be found in one of the data.table vignettes:

library(data.table)
set.seed(42)
DT <- data.table(x = rnorm(10), y = rlnorm(10), z = runif(10), g = c("a", "b"), key = "g")
library(zoo)
DT[, paste0("ravg_", c("x", "y")) := lapply(.SD, rollmean, k = 3, na.pad = TRUE), 
   by = g, .SDcols = c("x", "y")]

Upvotes: 20

Related Questions