Fixing the seed for parallel simulation runs with different number of cores

Question

I'd like to parallelize a simulation study to speed it up and I'd also like to account for reproducibility. In particular, I'd like to obtain the same result as if I used set.seed at the beginning of a sequential simulation run. Here is an example how I try to set it up (I purposefully use .inorder=T here):

library(doSNOW)
library(rlecuyer)

nr.cores = 4
nr.simulations = 10 
sample.size = 100000

seed = 12345

cl = makeCluster(nr.cores)
registerDoSNOW(cl)
clusterExport(cl=cl, list=c('sample.size'), envir=environment())
clusterSetupRNGstream(cl,rep(seed,6))

result = foreach(i=1:nr.simulations, .combine = 'c', .inorder=T)%dopar%{
  tmp = rnorm(sample.size)
  tmp[sample.size]
}

stopCluster(cl)

print(paste0('nr.cores = ',nr.cores,'; seed = ',seed,'; time =',Sys.time()))
print(result)

There are two questions that I have after running this example several times:

The number of cores impacts the resulting sequence, e.g., for nr.cores=1 and 4 only the first values coincide, and for nr.cores=4 and 8 the first four values coincide. Is there a way to have it independent of the nr.cores? Conceptually, I’d imagine I could create an RNG stream of size nr.simulations * sample.size, split it to nr.simulations pieces and distribute them to the nodes always in the same order. Even simpler, I could fix nr.simulations values of (different) seeds and again pass them in a fixed order to the nodes. This could be done having some kind of node mapping which could be used by the nodes to read its appropriate seed value from a table. Is there a way to do it?

When I run the script several times it happens (not always but from time to time) that the resulting sequence is reordered even though I do not change any of the parameters (I just source the file again and again). It just looks like a bug to me as either .inorder or clusterSetupRNGstream fail. Or am I missing something?

[1] "nr.cores = 4; seed = 12345; time =2017-09-08 19:00:24"
[1]  1.327091137 -1.800244293 -1.163391460  0.005980001  0.957521136  1.641354433 -1.219033091
[8] -0.238129356 -0.225193384  1.457018576

[1] "nr.cores = 4; seed = 12345; time =2017-09-08 19:00:28"
[1]  1.327091137 -1.800244293 -1.163391460  0.005980001 -0.238129356  0.957521136  1.641354433
[8] -1.219033091  0.870269174 -0.225193384

CPak · Accepted Answer

1st Q: The following seemed to work for me

library(parallel)
library(doParallel)
cl <- makeCluster(5)
registerDoParallel(cl)
seedlist <- c(100, 200, 300, 400, 500)
clusterExport(cl, 'seedlist')
foreach(I=1:5) %dopar% {set.seed(seedlist[I]); runif(1)}

[[1]]
[1] 0.3077661

[[2]]
[1] 0.5337724

[[3]]
[1] 0.9152467

[[4]]
[1] 0.1499731

[[5]]
[1] 0.8336


set.seed(100)
runif(1)
[1] 0.3077661

2nd Q: Seems like a bug but maybe someone else has a better clue

Fixing the seed for parallel simulation runs with different number of cores

Answers (2)

Related Questions