sharoz
sharoz

Reputation: 6345

How can I make a parallel operation faster than the serial version?

I'm attempting to "map" a function onto an array. However when trying both simple and complex functions, the parallel version is always slower than the serial version. How can I improve the performance of a parallel computation in R?

Simple parallel example:

library(parallel)

# Number of elements
arrayLength = 100
# Create data
input = 1:arrayLength

# A simple computation
foo = function(x, y) x^y - x^(y-1)

# Add complexity
iterations = 5 * 1000 * 1000

# Perform complex computation on each element
compute = function (x) {
  y = x
  for (i in 1:iterations) {
    x = foo(x, y)
  }
  return(x)
}

# Parallelized compute
computeParallel = function(x) {
  # Create a cluster with 1 fewer cores than are available.
  cl <- makeCluster(detectCores() - 1) # 8-1 cores 
  # Send static vars & funcs to all cores
  clusterExport(cl, c('foo', 'iterations'))
  # Map
  out = parSapply(cl, x, compute)
  # Clean up
  stopCluster(cl)
  return(out)
}

system.time(out <- compute(input)) # 12 seconds using 25% of cpu
system.time(out <- computeParallel(input)) # 160 seconds using 100% of cpu

Upvotes: 1

Views: 505

Answers (2)

Steve Weston
Steve Weston

Reputation: 19667

The problem is that you traded off all of the vectorization for parallelization, and that's a bad trade. You need to keep as much vectorization as possible to have any hope of getting an improvement with parallelization for this kind of problem.

The pvec function in the parallel package can be a good solution to this kind of problem, but it isn't supported in parallel on Windows. A more general solution which works on Windows is to use foreach with the itertools package which contains functions which are useful for iterating over various objects. Here's an example that uses the "isplitVector" function to create one subvector for each worker:

library(doParallel)
library(itertools)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
computeChunk <- function(x) {
  foreach(xc=isplitVector(x, chunks=getDoParWorkers()),
          .export=c('foo', 'iterations', 'compute'), 
          .combine='c') %dopar% {
    compute(xc)
  }
}

This still may not compare very well to the pure vector version, but it should get better as the value of "iterations" increases. It may actually help to decrease the number of workers unless the value of "iterations" is very large.

Upvotes: 1

Neal Fultz
Neal Fultz

Reputation: 9687

parSapply will run the function on each element of input separately, which means you are giving up the speed you gained from writing foo and compute in a vectorized fashion.

pvec will run a vectorized function on multiple cores by chunks. Try this:

system.time(out <- pvec(input, compute, mc.cores=4))

Upvotes: 0

Related Questions