me_overfolwn
me_overfolwn

Reputation: 64

Using parallel in R for whole scripts

I have a question regarding parallel computation of whole scripts. My script imports data, than splits by random in a train and validation dataframe, does a preprocessing and validation. I want to iterate the same script with many different seeds.

Is it possible to do this in parallel? The scripts don't interfere with each other.

seeds <- c(2343242,324256,764865,3524526,574574,75624,15436,674767,4325265,2462626,
           245264,647474,2465374,4253532,5787462,35636,357484,34524,74859,1352637)

for (i in 1:length(seeds))
  {
  set.seed(seeds[i])
  seed <- seeds[i]
  print(seeds[i])
  
  print("begin import")
  source(file = "import.r")
  print("preprocessing")
  source(file = "preProc.r")
  print("normal")
  source(file = "algorithms and datasets.r")
  print("resampled")
  source(file = "algorithms and datasets up down.r")
  
}

Upvotes: 1

Views: 311

Answers (2)

Guillem Pocull
Guillem Pocull

Reputation: 25

Rstudio has a very reliable, easy and intuitive way of running parallel scripts called Background jobs. The following link explains how to use it, but in summary, everytime you run a script, it runs in paralel, using multiple cores and it usually goes much faster (if the CPU and RAM are not super busy). There are two ways to use Background Jobs, the manual and the scripted:

  1. The manual way: you just open the background job and select the script and the working directory. Then you can choose to save the global environment or not. If your script already exports or save the objects at local, then do not worry about that.

  2. The scripted way: you can use JobRunScript to create background jobs with code. Therefore you can automate the changes you want from the scripts.

Upvotes: 2

HenrikB
HenrikB

Reputation: 6815

Verbatim one-to-one solution:

library(future.apply)
plan(multisession)

seeds <- c(2343242,324256,764865,3524526,574574,75624,15436,674767,4325265,2462626, 245264,647474,2465374 (but not ,4253532,5787462,35636,357484,34524,74859,1352637)

empty <- future_lapply(seeds, function(seed) {
  set.seed(seed)
  print(seed)
  print("begin import")
  source(file = "import.r")
  print("preprocessing")
  source(file = "preProc.r")
  print("normal")
  source(file = "algorithms and datasets.r")
  print("resampled")
  source(file = "algorithms and datasets up down.r")
})

Unless those seeds you've picked are essential is some way, you probably wanna use statistically sound parallel RNG instead, which you get automatically if you do:

library(future.apply)
plan(multisession)

set.seed(42) ## Optional to fix the initial seed
n <- 20L     ## Number of runs

empty <- future_lapply(1:n, function(ii) {
  print(.Random.seed)
  print("begin import")
  source(file = "import.r")
  print("preprocessing")
  source(file = "preProc.r")
  print("normal")
  source(file = "algorithms and datasets.r")
  print("resampled")
  source(file = "algorithms and datasets up down.r")
}, seed = TRUE)

Since we're not making use of ii here, the latter could equally well be using a futurized version base::replicate():

library(future.apply)
plan(multisession)

set.seed(42) ## Optional to fix the initial seed
n <- 20L     ## Number of runs

empty <- future_replicate(n, {
  print(.Random.seed)
  print("begin import")
  source(file = "import.r")
  print("preprocessing")
  source(file = "preProc.r")
  print("normal")
  source(file = "algorithms and datasets.r")
  print("resampled")
  source(file = "algorithms and datasets up down.r")
})

PS. It's not clear to me how you distinguish the results from the different runs. Maybe you rely on seed to save to different files in those scripts.

Upvotes: 2

Related Questions