Duck
Duck

Reputation: 39595

How to speed up row operations over large dataset when applying specific function

I am working with a big dataframe in R. My dataframe is Q, which I include a similar structure in my code. It has 250.000 rows and 1000 columns. My goal is to apply for each row a time series model in order to obtain the coefficients from each model. In my case, I will use auto.arima function from forecast package. I have tried two ways to solve my problem, which I include next:

library(forecast)
set.seed(123)
Q <- as.data.frame(matrix(rnorm(250000*1000),nrow = 250000,ncol = 1000,byrow = T))
#Approach 1
models <- apply(Q, 1, auto.arima)
#Extract coeffs
coeffs <- lapply(models, function(x) x$coef)

#Approach 2
#Create a list and save coeff using a loop
tlist <- list(0)
for(i in 1:dim(Q)[1])
{
  models <- apply(Q[i,], 1, auto.arima)
  coeffs <- as.data.frame(lapply(models, function(x) as.data.frame(t(x$coef))))
  tlist[[i]] <- coeffs
  gc()
}

In Approach 1, I used apply() function to create a list in order to save the models, and therefore I used lapply() to extract the coefficients. The issue with this approach is that it took 60 hours but it did not finished.

In approach 2, it is a classic loop in order to apply the function for each row and then save the results on a list. The situation was the same, 30 hours but it did not finished.

In both cases the task was not completed, ending up with my computer collapsed. I do not know how to solve this time issue because it looks like my solutions are very slow. My computer has 8GB ram and Windows 64 bit system. I would like to make this operation by row faster. It would be great if I can add the results from coefficients directly to Q but if it is not possible, a list with the results would be fantastic. Q is a dataframe but it can be also a datatable.

Is there any way to boost my code in order to obtain my results? Many thanks for your help.

Upvotes: 0

Views: 545

Answers (1)

Cole
Cole

Reputation: 11255

As @IanCampbell says in the comments, the auto.arima function is where most of the time is spent. I am on Windows with a 2-core machine and I always turn to future.apply for parallel tasks.

I used only a 250 x 100 matrix - there was no need for me to test for 60 hours :). With 2-cores, the time went from 20s to 14s.

library(forecast)
library(future.apply)

set.seed(123)
nr = 250L
nc = 100L

mat <- matrix(rnorm(nr * nc), nrow = nr, ncol = nc, byrow = TRUE)

system.time(models1a <- apply(mat, 1L, auto.arima))
##   user  system elapsed 
##  19.84    0.02   20.04 

plan("multiprocess") ## needed for future_apply to make use of multiple cores
system.time(models1b <- future_apply(mat, 1L, auto.arima))

##   user  system elapsed 
##   0.48    0.02   14.22 

## future_lapply not needed - this is fast
identical(lapply(models1a, '[[', "coef"), lapply(models1b, '[[', "coef"))
## [1] TRUE

Upvotes: 2

Related Questions