r c
r c

Reputation: 49

Speeding up a forecast

I want to forecast some values of traffic (I will call it Y). To do that I have data.table called X_Matrix with a bunch of variables (dimensions are 30000 by 34). X_Matrix has some lagged variables (Y(-1), Y(-2), Y(-3), Y(-4)). I use the predicted Y values to forecast other Y values in the future. I also add a error term to my prediction. My code looks like this:

numberOfObservations <- 1000
for (i in 5:numberOfObservations){
     error <- rnorm(1,0,standarderror)

     #calculating the value
     artificialY <- max(sum(X_matrix[i,]*parameters)+error,0)
     Y_matrix[i] <- artificialY

     #adjusting the lagged variables for calculations in the future
     X_matrix$lagged1[i+1] <- log1p(artificialY) 
     X_matrix$lagged2[i+2] <- log1p(artificialY)
     X_matrix$lagged3[i+3] <- log1p(artificialY)
     X_matrix$lagged4[i+4] <- log1p(artificialY)
}

This code works fine, but it take forever to calculate all these values.....

Is there any way to speed this proces up?

Although I think it is not necessary for this type of question, I can provide a portion of the data if you want that!

Thanks in advance,

RC

Upvotes: 0

Views: 100

Answers (1)

ClancyStats
ClancyStats

Reputation: 1231

I'll note that it's a good idea to provide some data even in cases like this as it saves people from needing to simulate a lot of data in the case that they want to give a reproducible example (like I prefer to do). The main thing to do here is to focus on vectorization instead of using a for loop. My main changes are as follows:

  • Use matrix multiplication for getting the Y values
  • Assign the lagged variables all at once
  • Only calculate log1p once

The three together make an improvement of a few orders of magnitude. Hope this helps!

your_way <- function() {
  parameters = rnorm(5)
  numberOfObservations <- 1000
  X_matrix = data.frame(matrix(rnorm(numberOfObservations * 5), ncol = 5))
  standarderror = .2
  Y_matrix = numeric(numberOfObservations)
  X_matrix$lagged1 = NA
  X_matrix$lagged2 = NA
  X_matrix$lagged3 = NA
  X_matrix$lagged4 = NA

  for (i in 1:numberOfObservations){
    error <- rnorm(1,0,standarderror)
    artificialY <- max(sum(X_matrix[i,1:5]*parameters)+error,0)
    Y_matrix[i] <- artificialY
    if(i < numberOfObservations) X_matrix$lagged1[i+1] <- log1p(artificialY)
    if(i < numberOfObservations - 1) X_matrix$lagged2[i+2] <- log1p(artificialY)
    if(i < numberOfObservations - 2) X_matrix$lagged3[i+3] <- log1p(artificialY)
    if(i < numberOfObservations - 3) X_matrix$lagged4[i+4] <- log1p(artificialY)
  }
  return(list(X_matrix, Y_matrix))
}

faster_way <- function() {
  parameters = rnorm(5)
  numberOfObservations <- 1000
  X_matrix <- data.frame(matrix(rnorm(numberOfObservations * 5), ncol = 5))
  n_sim <- numberOfObservations - 5
  standarderror <- .2
  Y_matrix <- as.vector(as.matrix(X_matrix) %*% parameters) + 
    rnorm(numberOfObservations, 0, standarderror)
  Y_matrix[Y_matrix < 0] <- 0
  lp1_Y = log1p(Y_matrix)
  X_matrix$lagged1 = c(NA, lp1_Y[1:(numberOfObservations-1)])
  X_matrix$lagged2 = c(NA, NA, lp1_Y[1:(numberOfObservations-2)])
  X_matrix$lagged3 = c(NA, NA, NA, lp1_Y[1:(numberOfObservations-3)])
  X_matrix$lagged4 = c(NA, NA, NA, NA, lp1_Y[1:(numberOfObservations-4)])

  return(list(X_matrix, Y_matrix))
}
set.seed(123)
res1 <- your_way()
set.seed(123)
res2 <- faster_way()
all.equal(res1, res2)
#> [1] TRUE

library(microbenchmark)
microbenchmark(your_way(), faster_way())
#> Unit: microseconds
#>          expr        min          lq        mean     median          uq
#>    your_way() 529118.816 560869.7300 586385.3679 573959.507 592229.9270
#>  faster_way()    591.804    641.3835    693.5161    669.631    703.8035
#>         max neval
#>  831027.359   100
#>    2292.149   100

Created on 2019-10-09 by the reprex package (v0.3.0)

Upvotes: 3

Related Questions