Reputation: 89
I'd like to parallelize a simulation study to speed it up and I'd also like to account for reproducibility. In particular, I'd like to obtain the same result as if I used set.seed
at the beginning of a sequential simulation run.
Here is an example how I try to set it up (I purposefully use .inorder=T
here):
library(doSNOW)
library(rlecuyer)
nr.cores = 4
nr.simulations = 10
sample.size = 100000
seed = 12345
cl = makeCluster(nr.cores)
registerDoSNOW(cl)
clusterExport(cl=cl, list=c('sample.size'), envir=environment())
clusterSetupRNGstream(cl,rep(seed,6))
result = foreach(i=1:nr.simulations, .combine = 'c', .inorder=T)%dopar%{
tmp = rnorm(sample.size)
tmp[sample.size]
}
stopCluster(cl)
print(paste0('nr.cores = ',nr.cores,'; seed = ',seed,'; time =',Sys.time()))
print(result)
There are two questions that I have after running this example several times:
The number of cores impacts the resulting sequence, e.g., for nr.cores=1
and 4
only the first values coincide, and for nr.cores=4
and 8
the first four values coincide.
Is there a way to have it independent of the nr.cores
? Conceptually, I’d imagine I could create an RNG stream of size nr.simulations * sample.size
, split it to nr.simulations
pieces and distribute them to the nodes always in the same order. Even simpler, I could fix nr.simulations
values of (different) seeds and again pass them in a fixed order to the nodes. This could be done having some kind of node mapping which could be used by the nodes to read its appropriate seed value from a table. Is there a way to do it?
When I run the script several times it happens (not always but from time to time) that the resulting sequence is reordered even though I do not change any of the parameters (I just source the file again and again). It just looks like a bug to me as either .inorder
or clusterSetupRNGstream
fail. Or am I missing something?
[1] "nr.cores = 4; seed = 12345; time =2017-09-08 19:00:24"
[1] 1.327091137 -1.800244293 -1.163391460 0.005980001 0.957521136 1.641354433 -1.219033091
[8] -0.238129356 -0.225193384 1.457018576
[1] "nr.cores = 4; seed = 12345; time =2017-09-08 19:00:28"
[1] 1.327091137 -1.800244293 -1.163391460 0.005980001 -0.238129356 0.957521136 1.641354433
[8] -1.219033091 0.870269174 -0.225193384
Upvotes: 3
Views: 1128
Reputation: 11738
So, to prove my point that the .inorder
parameter is just saying that you need the results in order, not that they will be computed in order, you can do
cl = makeCluster(nr.cores)
registerDoSNOW(cl)
replicate(10, {
foreach(ic = 1:8, .combine = 'c', .inorder = TRUE) %dopar% {
Sys.sleep(runif(1))
Sys.getpid()
}
})
stopCluster(cl)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 9252 9252 9252 9252 9252 9252 9252 9252 9252 9252
[2,] 9259 9259 9259 9259 9259 9259 9259 9259 9259 9259
[3,] 9266 9266 9266 9266 9266 9266 9266 9266 9266 9266
[4,] 9273 9273 9273 9273 9273 9273 9273 9273 9273 9273
[5,] 9273 9252 9259 9266 9273 9266 9252 9252 9266 9252
[6,] 9266 9266 9273 9273 9273 9259 9266 9259 9273 9266
[7,] 9266 9259 9252 9259 9259 9273 9273 9252 9259 9266
[8,] 9252 9252 9259 9266 9252 9252 9266 9252 9273 9259
Hum, not sure it really prove my point. It just shows that the clusters are computing in order at the beginning and then the first which finishes continues to compute.
As suggested by @Roland, you could use the package doRNG to do what you want. Let us verify:
library(doRNG)
cl = makeCluster(nr.cores)
registerDoSNOW(cl)
replicate(14, {
set.seed(12345)
sample.size = 100000
foreach(ic = 1:8, .combine = 'c', .inorder = TRUE) %dorng% {
Sys.sleep(runif(1))
tmp = rnorm(sample.size)
tmp[sample.size]
}
})
stopCluster(cl)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.42281264 0.42281264 0.42281264 0.42281264 0.42281264 0.42281264 0.42281264
[2,] -1.67678339 -1.67678339 -1.67678339 -1.67678339 -1.67678339 -1.67678339 -1.67678339
[3,] -0.49011636 -0.49011636 -0.49011636 -0.49011636 -0.49011636 -0.49011636 -0.49011636
[4,] -0.87165416 -0.87165416 -0.87165416 -0.87165416 -0.87165416 -0.87165416 -0.87165416
[5,] -1.02636022 -1.02636022 -1.02636022 -1.02636022 -1.02636022 -1.02636022 -1.02636022
[6,] 0.56549835 0.56549835 0.56549835 0.56549835 0.56549835 0.56549835 0.56549835
[7,] 0.03998101 0.03998101 0.03998101 0.03998101 0.03998101 0.03998101 0.03998101
[8,] -0.38754750 -0.38754750 -0.38754750 -0.38754750 -0.38754750 -0.38754750 -0.38754750
[,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 0.42281264 0.42281264 0.42281264 0.42281264 0.42281264 0.42281264 0.42281264
[2,] -1.67678339 -1.67678339 -1.67678339 -1.67678339 -1.67678339 -1.67678339 -1.67678339
[3,] -0.49011636 -0.49011636 -0.49011636 -0.49011636 -0.49011636 -0.49011636 -0.49011636
[4,] -0.87165416 -0.87165416 -0.87165416 -0.87165416 -0.87165416 -0.87165416 -0.87165416
[5,] -1.02636022 -1.02636022 -1.02636022 -1.02636022 -1.02636022 -1.02636022 -1.02636022
[6,] 0.56549835 0.56549835 0.56549835 0.56549835 0.56549835 0.56549835 0.56549835
[7,] 0.03998101 0.03998101 0.03998101 0.03998101 0.03998101 0.03998101 0.03998101
[8,] -0.38754750 -0.38754750 -0.38754750 -0.38754750 -0.38754750 -0.38754750 -0.38754750
Upvotes: 1
Reputation: 13591
1st Q: The following seemed to work for me
library(parallel)
library(doParallel)
cl <- makeCluster(5)
registerDoParallel(cl)
seedlist <- c(100, 200, 300, 400, 500)
clusterExport(cl, 'seedlist')
foreach(I=1:5) %dopar% {set.seed(seedlist[I]); runif(1)}
[[1]]
[1] 0.3077661
[[2]]
[1] 0.5337724
[[3]]
[1] 0.9152467
[[4]]
[1] 0.1499731
[[5]]
[1] 0.8336
set.seed(100)
runif(1)
[1] 0.3077661
2nd Q: Seems like a bug but maybe someone else has a better clue
Upvotes: 1