Parallel computing with foreach - Saving loop outputs to global list

Question

I would like to run a large loop with the foreach function.This means using the %dopar% operator.

I can't find any questions already answered to this problem exactly. If this is a duplicate though, please point me in the right direction and I'll close this question.

I have been having mixed success. It works for simple examples on my machine, as per the help documentation, however I cannot seem to get good results for my own work. My example is slightly more complicated, so the devil seems to be in the detail, as always! I have also read the 'white paper' provided by the package creators Revolution Analytics (you can get it here). I don't see how best to maybe use the .combine argument to apply results to my global output list. I would like to assign claculated value to one big list as opposed to using cbind or c

My example is pretty convoluted, but if I simplify it any further then any answers might not address my issue.

I will perform a kind of moving-linear model. So fit a model using lm() over 50 obersvations [1:50], predict the 51st observation [51], saving the results to a list. Then I will shift it all one observation further. So a lm over [2:51] and predict the 52nd observation [52]. I will use a total of 100 observations, so I can make a maximum of 50 predictions.

## ============================================ ##
##  Setup the backend for the foreach function  ##
## ============================================ ##

## doMC calls upon cores on demand, uses them and closes them
library(doMC)
registerDoMC(detectCores())     #detectCores() uses all cores

## for Windows users
#library(doParallel) --> for Windows users
#registerDoParallel(detectCores())

## ======================== ##
##  Create some dummy data  ##
## ======================== ##

## three columns, one hundred observations
my_data <- data.table(outcome = runif(100), V1 = 3*runif(100), V2 = sqrt(runif(100)))

## Have a look at the data if you like - using the DT package
library(DT)
datatable(my_data, options = list(pageLength = nrow(my_data)))

## ================================= ##
##  Perform the loop the normal way  ##
## ================================= ##

## Create container (a list of lists) for results
my_results <- sapply(c(paste0("step_", seq(1:50))), function(x) NULL)
step_results <- sapply(c("coefs", "rank", "error"), function(x) NULL)
for(i in 1:length(my_results)){my_results[[i]] <- step_results}

## Use a for loop to stpe through all the 50 'slices'
for(i in 1:50) {        #max. 50 predictions possible

    ## Fit a linear model
    my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])

    ## Predict the next step
    my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)]) 

    error <- my_data$outcome[i+50] - my_pred    #simply measure the delta to the actual value

    ## Assign some results to the container created earlier
    my_results[[i]][[1]] <- my_fit$coefficients
    my_results[[i]][[2]] <- my_fit$rank
    my_results[[i]][[3]] <- error

}
str(my_results)    ## Keep this container to compare to our next one

## ============================================ ##
##  Perform the loop using foreach and %dopar%  ##
## ============================================ ##

## Create same results object for results as previously for parallel results
par_results <- sapply(c(paste0("step_", seq(1:50))), function(x) NULL)
step_results <- sapply(c("coefs", "rank", "error"), function(x) NULL)
for(i in 1:length(par_results)){par_results[[i]] <- step_results}

my_results_par <- foreach(i = 1:50) %dopar%
    {        #max. 50 predictions possible

        my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])     
        my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)]) 
        error <- my_data$outcome[i+50] - my_pred 

        ## Assign some results to the container created earlier
        par_results[[i]][[1]] <- my_fit$coefficients
        par_results[[i]][[2]] <- my_fit$rank
        par_results[[i]][[3]] <- error

        Sys.sleep(i/20)    #Allows time to see R processes spawn on your system
        return(par_results)
    }

## We can see straight away that this didn't work as I would like it to
identical(my_results, my_results_par)   #FALSE

## This shows that the output seems good on the surface
class(my_results_par)
length(my_results_par)
## This shows that it doesn't (WARNING: very long)
str(my_results_par)

You can try out the various .combine arguments in the foreach function, for example:

foreach(i = 1:50, .combine = "c") {computation}

or

foreach(i = 1:50, .combine = "cbind") {computation}

these prodice a vector and a matrix respectively, but do not contain all the results that I was trying to save in each loop.

Questions

Does that structure give you a clue as to what is going on?
How might I use .combine argument to create my desired output?
Is what I am trying to do even possible??
Do I need to put the loop with foreach at a different point in the algorithm?

I have read that you can supply a custom function to foreach... might this be the way to do it? I still don't see how I would combine the results.

J.R. · Accepted Answer

Yes, this can easily be done. We can modify your code for the foreach-step to the following, where we export the data.table package to each worker.

my_results_par <- foreach(i = 1:50, .combine = append, .packages = c("data.table")) %dopar%
    {      
        my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])     
        my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)]) 
        error <- my_data$outcome[i+50] - my_pred 

        par_results <- list(
            coefs = my_fit$coefficients,
            rank = my_fit$rank,
            error = error
        )
        par_results <- list(par_results)
        names(par_results) <- paste0("step_", i)
        return(par_results)
    }
identical(my_results, my_results_par)   
[1] TRUE

Parallel computing with foreach - Saving loop outputs to global list

Answers (1)

Related Questions