Reputation: 39595
I am working with a big dataframe in R
. My dataframe is Q
, which I include a similar structure in my code. It has 250.000 rows and 1000 columns. My goal is to apply for each row a time series model in order to obtain the coefficients from each model. In my case, I will use auto.arima
function from forecast
package. I have tried two ways to solve my problem, which I include next:
library(forecast)
set.seed(123)
Q <- as.data.frame(matrix(rnorm(250000*1000),nrow = 250000,ncol = 1000,byrow = T))
#Approach 1
models <- apply(Q, 1, auto.arima)
#Extract coeffs
coeffs <- lapply(models, function(x) x$coef)
#Approach 2
#Create a list and save coeff using a loop
tlist <- list(0)
for(i in 1:dim(Q)[1])
{
models <- apply(Q[i,], 1, auto.arima)
coeffs <- as.data.frame(lapply(models, function(x) as.data.frame(t(x$coef))))
tlist[[i]] <- coeffs
gc()
}
In Approach 1, I used apply()
function to create a list in order to save the models, and therefore I used lapply()
to extract the coefficients. The issue with this approach is that it took 60 hours but it did not finished.
In approach 2, it is a classic loop in order to apply the function for each row and then save the results on a list. The situation was the same, 30 hours but it did not finished.
In both cases the task was not completed, ending up with my computer collapsed. I do not know how to solve this time issue because it looks like my solutions are very slow. My computer has 8GB ram and Windows 64 bit system. I would like to make this operation by row faster. It would be great if I can add the results from coefficients directly to Q
but if it is not possible, a list with the results would be fantastic. Q
is a dataframe but it can be also a datatable.
Is there any way to boost my code in order to obtain my results? Many thanks for your help.
Upvotes: 0
Views: 545
Reputation: 11255
As @IanCampbell says in the comments, the auto.arima
function is where most of the time is spent. I am on Windows with a 2-core machine and I always turn to future.apply
for parallel tasks.
I used only a 250 x 100 matrix - there was no need for me to test for 60 hours :). With 2-cores, the time went from 20s to 14s.
library(forecast)
library(future.apply)
set.seed(123)
nr = 250L
nc = 100L
mat <- matrix(rnorm(nr * nc), nrow = nr, ncol = nc, byrow = TRUE)
system.time(models1a <- apply(mat, 1L, auto.arima))
## user system elapsed
## 19.84 0.02 20.04
plan("multiprocess") ## needed for future_apply to make use of multiple cores
system.time(models1b <- future_apply(mat, 1L, auto.arima))
## user system elapsed
## 0.48 0.02 14.22
## future_lapply not needed - this is fast
identical(lapply(models1a, '[[', "coef"), lapply(models1b, '[[', "coef"))
## [1] TRUE
Upvotes: 2