Parallel processing using "parallel" in R

I have data table that I am uploading in R (.csv format with multiple columns) on which certain user-defined rules need to be applied (another file in .csv format, where each row defines a rule).

Currently I am using a for loop to iterate through the dataset and apply these rules (thus making them sequential). However, the output of each of these rules is independent and, thus, I want to ensure they run in parallel.

I explored the parallel package in R but couldn't narrow down on what I'm looking for. What I'm looking for is -

dataset is my data table
rulesSet is my rules file

^ These are being read in an .Rmd file

applyRules(dataset,rulesSet) is the function which takes the above two as parameters and returns a resultant data

^This function is present in a separate util.R file but is being called from the .Rmd

Each row of rulesSet to apply to dataset and return the resultant data table. I tried writing -

clusterApplyLB(cl=clust,
               sampleRules,
               fun=function(x){
                   applyRules(sampleDataset, x, "union")
               })

and also parLapply/Sapply using the same format but in vain (I get an error saying could not find function applyRules)

Could someone tell me where I'm going wrong?

Upvotes: 0

Views: 429

Answers (2)

HenrikB
HenrikB

Reputation: 6815

Author of the future framework here. If you can get the following, sequential call, to work:

res <- lapply(sampleRules, FUN = function(x) {
  applyRules(sampleDataset, x, "union")
})

then you might have better success with:

library(future.apply)
plan(multiprocess)

res <- future_lapply(sampleRules, FUN = function(x) {
  applyRules(sampleDataset, x, "union")
})

because it tries to identify all dependent packages and global objects and automatically export them to the workers.

Upvotes: 1

divibisan
divibisan

Reputation: 12165

When you make a cluster using parallel or snow it generates a number of nodes which are actually separate rsession instances (you can check this by looking in Activity Monitor, Task Manager, or top while they're running). Since they're separate R Sessions, they each have their own environment and cannot see objects loaded in your main R environment. You need to use the clusterExport function to export any objects that the nodes need to work with into their environment before you run clusterApply.

Now, in your case, this result is strange because you can pass in objects and functions through the clusterApplyLB function. However, the error you're getting tells me that, for whatever reason, one of your nodes is trying to call applyRules from an environment where it's not available. Try exporting your function (and possibly your datasets as well, if necessary) to the cluster as below and see if that solves your problem.

cl <- makecluster(4)
clusterExport(cl, 'applyRules')
results <- clusterApplyLB(cl,
                          sampleRules,
                          fun = function(x) {
                              applyRules(sampleDataset, x, "union")
                          })
stopCluster(cl)

Upvotes: 1

Related Questions