armen
armen

Reputation: 443

parallel training in caret package of R, when using custom sampler

I'm trying to provide my own sampler to the train function of the caret package (because of imbalanced data) and then train the model in a parallel environment. If I don't give the sampler to the train it works fine. If I give the sampler to the train but not use the parallel capability then again it works fine. But if I ask it to run in parallel with the sampler, then it gives me an error. I have tried running on two different systems and the result is the same but the error that I get in two situations are different. Here is an example:

library(caret)
set.seed(1)
data(iris)

library(DMwR)
library(doParallel)
cl <- makeCluster(3)
cl <- makeCluster(1) #uncommenting this will make the code work 
print(cl)
registerDoParallel(cl)

smote_wrapper <- list(
        name = "custom_smoting",
        func = function(x, y) {
                #print(dim(x))
                print(length(y))
                data <- cbind(x, data.frame(Class = y))
                #print(table(data$Class))
                print("calling smote")
                final <- SMOTE(Class~., data, perc.over = 50, perc.under = 50)
                print("smote over")
                #print(dim(final))
                final$Class <- as.factor(final$Class)
                print(table(final$Class))
                class_index <- which(colnames(final) == "Class")
                print(paste("dim:", dim(final)))
                result <- list(x = final[,-class_index], y = final$Class)
                result
        },
        first = FALSE
)
data(iris)
control <- trainControl(sampling = smote_wrapper)
model <- train(Species~., iris, method = "svmLinear2", trControl = control)
stopCluster(cl)

On one system it stops training the mode and gives the error:

Error in { : task 1 failed - "object 'out2' not found

And on the other system it gives:

Something is wrong; all the Accuracy metric values are missing:
Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :3     NA's   :3    
Error: Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

Maybe sampler doesn't work in parallel?

I was using the latest CRAN installation of caret (6.0.77) but due to another error ("optimismBoot not found") I had to install the latest version from github (devtools::install_github).

Upvotes: 1

Views: 504

Answers (1)

CPak
CPak

Reputation: 13581

Looks like you might need to export your packages and variables to the cluster

registerDoParallel(cl)
# try these lines
clusterEvalQ(cl, { library(DMwR) })
clusterExport(cl, "smote_wrapper")

In parallel mode, caret will look in each new worker's environment for packages/variables but if you don't export them, they will not be available. Hope this helps.

Upvotes: 3

Related Questions