lui.lui
lui.lui

Reputation: 81

Utilize serveral multicore linux servers for computation in R

I have four 32 cores linux servers (CentOS 7) that I would like to utilize for a parallelized computation in R

So far I have been only using doMC packages and registerDoMC(cores=32) to utilize the multicore capabilities of a single server. I would like to expand this to all four servers (i.e. 128=32x4, if possible)

I have done some searching online, seems like there are a bunch of choices: PSOCK, MPI, SNOW, SparkR, etc. Nonetheless, I could not get it work with any suggestion online.

I am aware there are some prerequisites, here is what I have done so far: 1) All servers are all "connected", ie. can SSH to each other with no-password login 2) NFS mounted so all servers can all access (read, write and execute access) 3) All servers run on the the same R binaries (under anaconda build on a shared locations which all servers can executed) 4) Installed openmpi, Rmpi, snow, doSNOW, Spark, SparkR (although I don't know how to use it)

Can another give some advise what I can do next?

Thanks a lot

Upvotes: -1

Views: 218

Answers (1)

HenrikB
HenrikB

Reputation: 6815

Have a look at the future package (I'm the author). It provides an ecosystem that wraps up various parallel backends in a unified API. In your particular case with four multiple 32-core machines to which you've already got SSH "batch" access, you can specify your 4*32 workers as:

library("future")

## Set up 4 * 32 workers on four machines
machines <- c("node1", "node2", "node3", "node4")
workers <- rep(machines, each = 32L)
plan(cluster, workers = workers)

If your machines don't have hostnames, you can specify their IP numbers instead.

Next, if you like to use foreach, just continue with:

library("doFuture")
registerDoFuture()

y <- foreach(i = 1:100) %dopar% {
  ...
  value
}

If you prefer lapply, you can use future.apply as:

library("future.apply")

y <- future_lapply(1:100, FUN = function(i) {
  ...
  value
})

Technical details:
The above sets up a PSOCK cluster as defined by the 'parallel' package. These are basically the same as SNOW clusters and by the same author who I think also consider SNOW cluster deprecated in place of what 'parallel' provides. In other words, AFAIK there is no point in using snow/doSNOW anymore; parallel/doParallel replaces those these days.

I'd put MPI clusters under the section of "advanced usage", i.e. unless you have one already set up and running, or unless you really think you need MPI, I would hold back on those. MPI also encourage a different algorithm design in order to take full advantage of them. PSOCK clusters take you a long way and only of you think you've exhausted those, you should look into MPI.

Spark is a whole different creature. It's designed around distributed computing on distributed data (in RAM). You're analysis might require that, but, again, I recommend that you start with above PSOCK clusters - they take you a long way.

A final PS, if you have a HPC scheduler (doesn't sound like it), just use, say, plan(future.batchtools::batchtools_sge) instead. Nothing else in your code needs to be changed.

Upvotes: 2

Related Questions