Dave
Dave

Reputation: 2386

doSNOW/foreach Performance Issues

I have a loop which is loading decent size files around 5MB each and than running some computations on them. I need to load 500-1000 of them. Seems like an easy job for foreach.

I am doing this but the performance of doSNOW seems to be horrendous.

I found this post and this fellow seems to have had the same issues:

http://statsadventure.blogspot.com/2012/06/performance-with-foreach-dosnow-and.html

So a couple of questions.

  1. Is there an alternative to doSnow? I realize there is doMC but I am running windows.
  2. Is doMC on linux that much faster than doSNOW?
  3. Is there anyway to output to screen from a worker so I can at least get some sort of idea how my job is progressing.

Thank you in advance!

Upvotes: 1

Views: 1361

Answers (1)

cbeleites
cbeleites

Reputation: 14093

Multiple threads trying to access different files on the hard disk can lead to very bad performance.

However, load balanced parallelization may still lead to improvement if enough time goes into calculations: the nodes will get out of synchronization thus hard disk requests will come in one after the other instead of all at the same time.

Here's a simple example of snow::clusterApply vs the load balanced snow::clusterApplyLB. I use snow instead of parallel as it provides timing and plotting:

library (snow)
system(sprintf('taskset -p 0xffffffff %d', Sys.getpid()))
cl <- makeSOCKcluster (rep ("localhost", 2))

times <- sample (1:6) / 4
times
## [1] 1.50 0.25 0.75 1.00 0.50 1.25

t <- snow.time (l <- clusterApply (cl, times, function (x) Sys.sleep (x)))
plot (t, main = "\n\nclusterApply") 
for (i in 1 : 2)
  points (t$data[[i]][,"send_start"], rep (i, 3), pch = 20, cex = 2)

clusterApply

tlb <- snow.time (l <- clusterApplyLB (cl, times, function (x) Sys.sleep (x)))
plot (tlb, main = "\n\nclusterApplyLB")
for (i in 1 : 2)
  points (tlb$data[[i]][,"send_start"], rep (i, 3), pch = 20, cex = 2)

clusterApplyLB

The black dots mark the start of a new function call. If the function starts with loading the file, all nodes will always try to access the hard disk at the same time with clusterApply because the cluster waits for all nodes to return results before dealing out the new round of tasks. With clusterApplyLB, the next task is handed out as soon as a node returned the result. Even if the tasks take basically the same time, they will get out of synchronization rather fast and the file loading will not be exactly at the same time.

(I don't know whether this is the actual problem, though)

Upvotes: 3

Related Questions