Reputation: 757
I'm trying to put together a function which will replicate the following
library(tidyverse)
library(magrittr)
library(data.table)
library(parallel)
library(RcppRoll)
windows <- (1:10)*600
df2 <- setDT(df_1, key=c("Match","Name"))[
,by=.(Match, Name), paste0("Period_", 1:10)
:= mclapply((1:10)*600, function(x) roll_mean(Dist, x))][]
It creates a rolling average based off the values assigned to windows
I have a working function which replicates it however, I have a feeling there's a better way of doing it as the function version takes almost 30 times longer to process the data
dt_rolling <- function(df, the.keys, x, y, z, window){
df <- data.table(df)
setkeyv(df, the.keys)
df[,by=.(x,y), paste0("Period_", window) := mclapply(window, function(a) roll_mean(z, a))][]
}
df2 <- dt_rolling(df_1, the.keys=c('Match', 'Name'), df_1$Match, df_1$Name, df_1$Dist, windows)
The data in question looks like this
> dput(head(df_1, 5))
structure(list(Match = c("BathH", "BathH", "BathH", "BathH",
"BathH"), Name = c("Alafoti Faosiliva", "Alafoti Faosiliva",
"Alafoti Faosiliva", "Alafoti Faosiliva", "Alafoti Faosiliva"
), Dist = c(0, 0, 0, 0, 0), Period_1 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_2 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_3 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_4 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_5 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_6 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_7 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_8 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_9 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_10 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_600 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_1200 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_1800 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_2400 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_3000 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_3600 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_4200 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_4800 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_5400 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), Period_6000 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_)), sorted = c("Match", "Name"), class = c("data.table",
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x10280cae0>)
It can extend to over 20 million rows so that's why I'm using a data.table
approach here along with investigating changing it to a function
Upvotes: 1
Views: 1333
Reputation: 16697
Fast rolling mean is available in data.table since v1.12.0 version.
Following query will address your question.
df_1[, paste0("Period_", windows) := frollmean(Dist, windows)]
adding some benchmark, unfortunately doesn't work for RcppRoll. Changed from mclapply to lapply for easier seeing the error.
library(data.table)
library(parallel)
library(RcppRoll)
windows = (1:10)*600
N = 1.5e6
set.seed(108)
dt = data.table(Match=letters[sample(26, N, TRUE)], Name=letters[sample(26, N, TRUE)], Dist=rnorm(N), key=c("Match","Name"))
dt1 = copy(dt)
dt2 = copy(dt)
system.time(
a1 <- dt1[, paste0("Period_", windows) := lapply(windows, function(w) roll_mean(Dist, w)), by=.(Match,Name)]
)
#Error in roll_mean_impl(x, as.integer(n), as.numeric(weights), as.integer(by), :
#negative length vectors are not allowed
system.time(
a2 <- dt2[, paste0("Period_", windows) := frollmean(Dist, windows),by=.(Match,Name)]
)
# user system elapsed
# 0.554 0.040 0.209
Upvotes: 7