Reputation: 2386
I have a loop which is loading decent size files around 5MB each and than running some computations on them. I need to load 500-1000 of them. Seems like an easy job for foreach.
I am doing this but the performance of doSNOW seems to be horrendous.
I found this post and this fellow seems to have had the same issues:
http://statsadventure.blogspot.com/2012/06/performance-with-foreach-dosnow-and.html
So a couple of questions.
Thank you in advance!
Upvotes: 1
Views: 1361
Reputation: 14093
Multiple threads trying to access different files on the hard disk can lead to very bad performance.
However, load balanced parallelization may still lead to improvement if enough time goes into calculations: the nodes will get out of synchronization thus hard disk requests will come in one after the other instead of all at the same time.
Here's a simple example of snow::clusterApply
vs the load balanced snow::clusterApplyLB
. I use snow instead of parallel as it provides timing and plotting:
library (snow)
system(sprintf('taskset -p 0xffffffff %d', Sys.getpid()))
cl <- makeSOCKcluster (rep ("localhost", 2))
times <- sample (1:6) / 4
times
## [1] 1.50 0.25 0.75 1.00 0.50 1.25
t <- snow.time (l <- clusterApply (cl, times, function (x) Sys.sleep (x)))
plot (t, main = "\n\nclusterApply")
for (i in 1 : 2)
points (t$data[[i]][,"send_start"], rep (i, 3), pch = 20, cex = 2)
tlb <- snow.time (l <- clusterApplyLB (cl, times, function (x) Sys.sleep (x)))
plot (tlb, main = "\n\nclusterApplyLB")
for (i in 1 : 2)
points (tlb$data[[i]][,"send_start"], rep (i, 3), pch = 20, cex = 2)
The black dots mark the start of a new function call. If the function starts with loading the file, all nodes will always try to access the hard disk at the same time with clusterApply
because the cluster waits for all nodes to return results before dealing out the new round of tasks. With clusterApplyLB
, the next task is handed out as soon as a node returned the result. Even if the tasks take basically the same time, they will get out of synchronization rather fast and the file loading will not be exactly at the same time.
(I don't know whether this is the actual problem, though)
Upvotes: 3