Carbon
Carbon

Reputation: 3943

How do I parallelize in r on windows - example?

How do I get parallelizaton of code to work in r in Windows? Include a simple example. Posting this self-answered question because this was rather unpleasant to get working. You'll find package parallel does NOT work on its own, but package snow works very well.

Upvotes: 38

Views: 43522

Answers (6)

Mike Thompson
Mike Thompson

Reputation: 11

For what it is worth. I was running into the same problem but couldn't get any of these to work. I eventually learned that Rstudio has a 'jobs' pane and can run models in the background each on it's own core. so what I did was divy-up my model into 10 segments (it was iterative over 100 vectors so 10 scripts of 10 vectors each) and ran each as a separate job. that way when one finished I could use the output from it immediately and I could keep working on my script without waiting for each model to finish. here is the link all about using jobs https://blog.rstudio.com/2019/03/14/rstudio-1-2-jobs/

Upvotes: 1

SummerEla
SummerEla

Reputation: 1952

I'm posting a cross-platform answer here because all the other answers I found were over-complicated for what I needed to accomplish. I'm using an example where I'm reading in all sheets of an excel workbook.

# read in the spreadsheet 
parallel_read <- function(file){

    # detect available cores and use 70%
    numCores  = round(parallel::detectCores() * .70)

    # check if os is windows and use parLapply
    if(.Platform$OS.type == "windows") {

        cl <- makePSOCKcluster(numCores)
        parLapply(cl, file, readxl::read_excel)
        stopCluster(cl)

        return(dfs)

    # if not Windows use mclapply
    } else {

        dfs <-parallel::mclapply(excel_sheets(file), 
                         readxl::read_excel, 
                         path = file,
                         mc.cores=numCores)

        return(dfs)

   }
}

Upvotes: 3

Carbon
Carbon

Reputation: 3943

Posting this because this took me bloody forever to figure out. Here's a simple example of parallelization in r that will let you test if things are working right for you and get you on the right path.

library(snow)
z=vector('list',4)
z=1:4
system.time(lapply(z,function(x) Sys.sleep(1)))
cl<-makeCluster(###YOUR NUMBER OF CORES GOES HERE ###,type="SOCK")
system.time(clusterApply(cl, z,function(x) Sys.sleep(1)))
stopCluster(cl)

You should also use library doSNOW to register foreach to the snow cluster, this will cause many packages to parallelize automatically. The command to register is registerDoSNOW(cl) (with cl being the return value from makeCluster()) , the command that undoes registration is registerDoSEQ(). Don't forget to turn off your clusters.

Upvotes: 50

Sander van den Oord
Sander van den Oord

Reputation: 12808

This worked for me, I used package doParallel, required 3 lines of code:

# process in parallel
library(doParallel) 
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)

# turn parallel processing off and run sequentially again:
registerDoSEQ()

Calculation of a random forest decreased from 180 secs to 120 secs (on a Windows computer with 4 cores).

Upvotes: 18

Mark
Mark

Reputation: 113

Based on the information here I was able to convert the following code into a parallelised version that worked under R Studio on Windows 7.

Original code:

#
# Basic elbow plot function
#
wssplot <- function(data, nc=20, seed=1234){
    wss <- (nrow(data)-1)*sum(apply(data,2,var))
    for (i in 2:nc){
        set.seed(seed)
        wss[i] <- sum(kmeans(data, centers=i, iter.max=30)$withinss)}
    plot(1:nc, wss, type="b", xlab="Number of clusters", 
       ylab="Within groups sum of squares")
}

Parallelised code:

library("parallel")

workerFunc <- function(nc) {
  set.seed(1234)
  return(sum(kmeans(my_data_frame, centers=nc, iter.max=30)$withinss)) }

num_cores <- detectCores()
cl <- makeCluster(num_cores)
clusterExport(cl, varlist=c("my_data_frame")) 
values <- 1:20 # this represents the "nc" variable in the wssplot function
system.time(
  result <- parLapply(cl, values, workerFunc) )  # paralel execution, with time wrapper
stopCluster(cl)
plot(values, unlist(result), type="b", xlab="Number of clusters", ylab="Within groups sum of squares")

Not suggesting it's perfect or even best, just a beginner demonstrating that parallel does seem to work under Windows. Hope it helps.

Upvotes: 5

rischan
rischan

Reputation: 1585

I think these libraries will help you:

foreach (facilitates executing the loop in parallel)
doSNOW (I think you already use it)
doMC (multicore functionality of the parallel package)

May these article also help you

http://vikparuchuri.com/blog/parallel-r-loops-for-windows-and-linux/

http://www.joyofdata.de/blog/parallel-computing-r-windows-using-dosnow-foreach/

Upvotes: 4

Related Questions