Empiromancer
Empiromancer

Reputation: 3854

Assignment inside parallelized foreach loop

I've been attempting to run a large parallel operation, but learned to my chagrin that I can't make assignments that stick inside a parallelized foreach loop. That is, attempting to run to following code results in no change to p

p <- numeric(3)
foreach(i=1:3) %dopar% {
  p[i] <- 1
}
p
# [1] 0 0 0

I thought it might be an environment issue (i.e., the assignment to p is local), but changing <- to <<- only gave me an error: Error in { : task 1 failed - "object 'p' not found"

Is there some way to either make the subassignment work or work around this problem?

In my real case, p[i] <- 1 is actually a subassignment of many elements at once, at random (but predetermined prior to the loop) places in the vector, so taking advantage of something like .combine = c is sadly out of the question.

What I've tried so far:

I tried working around this by using .combine = `+`, like so:

s <- foreach(i=1:3, .combine = `+`) %dopar% {
  p <- numeric(3)
  p[i] <- 1
  p
}

While this worked for my small test cases, when I went to apply it to my full size case I got an error (after it ran for about 6 hours, mind you) that R couldn't allocate a vector of size 6.1 GB. Note that this is much larger than the size of the individual several hundred MB vectors each loop is meant to be producing, which I suppose means there was some hidden concatenation that took place.

Particulars of my case

My problem involves performing a k-fold cross validation, which means each row of data is assigned a fold 1 to K, and the foreach loop is looping through the folds k = 1:K, fitting a model on data with folds != k, and then using that model to predict on the remaining data (folds == k). So, ignoring for a moment that this code won't work, I'd like to do something like

folds <- sample(1:K, nrow(mydata), replace = TRUE)
preds <- numeric(nrow(mydata))
foreach(k=1:K) %do% {
  m <- fit_model(...)                    # Pseudocode
  preds[folds == k] <- predict_on_model(...) # Pseudocode
}

Thus, my challenge is to get the output of the foreach loop in the correct order.

Upvotes: 4

Views: 4076

Answers (1)

Steve Weston
Steve Weston

Reputation: 19667

Many people get confused when they first notice that you can't modify variables outside of a parallel loop using foreach. You could solve your problem by using a "combine" function that performs the appropriate assignments. For example:

library(doSNOW)
cl <- makeSOCKcluster(4)
registerDoSNOW(cl)
K <- 10
N <- 100
set.seed(4325)
folds <- sample(1:K, N, replace=TRUE)

comb <- function(p, ...) {
  for (r in list(...)) {
    p[folds == r$k] <- r$p
  }
  p
}

preds <-
  foreach(k=1:K, .combine='comb', .init=numeric(N),
          .multicombine=TRUE) %dopar% {
    p <- 100 + k  # replace this
    list(k=k, p=p)  # include data needed by the combine function
  }

The foreach loop performs the parallel computations and the "combine" function performs the assignments. Notice the use of the foreach .init argument to specify the initial value of the preds vector. The predictions will be accumulated in this vector every time the combine function is called.

Another solution is to reorder the results using a "final" function that uses the folds vector:

reorder <- function(p) p[folds]
preds <-
  foreach(k=1:K, .combine='c', .final=reorder) %dopar% {
    100 + k  # replace this
  }

Although it's a less general technique, I suspect this will be more efficient.

Upvotes: 4

Related Questions