Reputation: 49
I want to forecast some values of traffic (I will call it Y). To do that I have data.table
called X_Matrix
with a bunch of variables (dimensions are 30000 by 34). X_Matrix
has some lagged variables (Y(-1), Y(-2), Y(-3), Y(-4)). I use the predicted Y values to forecast other Y values in the future. I also add a error term to my prediction. My code looks like this:
numberOfObservations <- 1000
for (i in 5:numberOfObservations){
error <- rnorm(1,0,standarderror)
#calculating the value
artificialY <- max(sum(X_matrix[i,]*parameters)+error,0)
Y_matrix[i] <- artificialY
#adjusting the lagged variables for calculations in the future
X_matrix$lagged1[i+1] <- log1p(artificialY)
X_matrix$lagged2[i+2] <- log1p(artificialY)
X_matrix$lagged3[i+3] <- log1p(artificialY)
X_matrix$lagged4[i+4] <- log1p(artificialY)
}
This code works fine, but it take forever to calculate all these values.....
Is there any way to speed this proces up?
Although I think it is not necessary for this type of question, I can provide a portion of the data if you want that!
Thanks in advance,
RC
Upvotes: 0
Views: 100
Reputation: 1231
I'll note that it's a good idea to provide some data even in cases like this as it saves people from needing to simulate a lot of data in the case that they want to give a reproducible example (like I prefer to do). The main thing to do here is to focus on vectorization instead of using a for
loop. My main changes are as follows:
log1p
onceThe three together make an improvement of a few orders of magnitude. Hope this helps!
your_way <- function() {
parameters = rnorm(5)
numberOfObservations <- 1000
X_matrix = data.frame(matrix(rnorm(numberOfObservations * 5), ncol = 5))
standarderror = .2
Y_matrix = numeric(numberOfObservations)
X_matrix$lagged1 = NA
X_matrix$lagged2 = NA
X_matrix$lagged3 = NA
X_matrix$lagged4 = NA
for (i in 1:numberOfObservations){
error <- rnorm(1,0,standarderror)
artificialY <- max(sum(X_matrix[i,1:5]*parameters)+error,0)
Y_matrix[i] <- artificialY
if(i < numberOfObservations) X_matrix$lagged1[i+1] <- log1p(artificialY)
if(i < numberOfObservations - 1) X_matrix$lagged2[i+2] <- log1p(artificialY)
if(i < numberOfObservations - 2) X_matrix$lagged3[i+3] <- log1p(artificialY)
if(i < numberOfObservations - 3) X_matrix$lagged4[i+4] <- log1p(artificialY)
}
return(list(X_matrix, Y_matrix))
}
faster_way <- function() {
parameters = rnorm(5)
numberOfObservations <- 1000
X_matrix <- data.frame(matrix(rnorm(numberOfObservations * 5), ncol = 5))
n_sim <- numberOfObservations - 5
standarderror <- .2
Y_matrix <- as.vector(as.matrix(X_matrix) %*% parameters) +
rnorm(numberOfObservations, 0, standarderror)
Y_matrix[Y_matrix < 0] <- 0
lp1_Y = log1p(Y_matrix)
X_matrix$lagged1 = c(NA, lp1_Y[1:(numberOfObservations-1)])
X_matrix$lagged2 = c(NA, NA, lp1_Y[1:(numberOfObservations-2)])
X_matrix$lagged3 = c(NA, NA, NA, lp1_Y[1:(numberOfObservations-3)])
X_matrix$lagged4 = c(NA, NA, NA, NA, lp1_Y[1:(numberOfObservations-4)])
return(list(X_matrix, Y_matrix))
}
set.seed(123)
res1 <- your_way()
set.seed(123)
res2 <- faster_way()
all.equal(res1, res2)
#> [1] TRUE
library(microbenchmark)
microbenchmark(your_way(), faster_way())
#> Unit: microseconds
#> expr min lq mean median uq
#> your_way() 529118.816 560869.7300 586385.3679 573959.507 592229.9270
#> faster_way() 591.804 641.3835 693.5161 669.631 703.8035
#> max neval
#> 831027.359 100
#> 2292.149 100
Created on 2019-10-09 by the reprex package (v0.3.0)
Upvotes: 3