Hack-R
Hack-R

Reputation: 23214

Why does caret's "parRF" lead to tuning and missing value errors not present with "rf"

I have a tidy dataset with no missing values and only numeric columns.

The dataset is both large and contains sensitive information, so I won't be able to provide a copy of it here, unfortunately.

I partition this data into training and testing sets with caret's createDataPartition:

idx      <- createDataPartition(y = model_final$y, p = 0.6, list = FALSE )
training <- model_final[idx,]
testing  <- model_final[-idx,]
x        <- training[-ncol(training)]
y        <- training$y
x1       <- testing[-ncol(testing)]
y1       <- testing$y

row.names(training) <- NULL
row.names(testing)  <- NULL
row.names(x)        <- NULL
row.names(y)        <- NULL
row.names(x1)       <- NULL
row.names(y1)       <- NULL

I've been fitting and refitting Random Forest models on this data via randomForest on a regular basis:

  rf <- randomForest(x = x, y = y, mtry = ncol(x), ntree = 1000,
                     corr.bias = T, do.trace = T, nPerm = 3) 

I decided to see if I could get any better or faster results with train and the following model ran fine, but took about 2 hours:

rf_train <- train(y=y, x=x,
               method='rf', tuneLength = 3,
               trControl=trainControl(method='cv',number=10,
                                      classProbs = TRUE
               )

I need to take an HPC approach to make this logistically feasible, so I tried

require(doParallel)
registerDoParallel(cores = 8)
rf_train <- train(y=y, x=x,
               method='parRF', tuneGrid = data.frame(mtry = 3), na.action = na.omit,
               trControl=trainControl(method='cv',number=10,
                                      classProbs = TRUE, allowParallel = TRUE)
               )

but regardless of if I use tuneLength or tuneGrid, this leads to strange errors about missing values and tuning parameters:

Error in train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3),  : 
  final tuning parameters could not be determined
In addition: Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
2: In train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3),  :
  missing values found in aggregated results

I say this is weird both because there were no errors with method = "rf" and because I tripled checked to ensure there are no missing values.

I even get the same errors when completely omitting tuning options. I also tried toggling the na.action option on and off and changing "cv" to "repeatedcv".

I even get the same error with this ultra-simplified version:

rf_train <- train(y=y, x=x, method='parRF')

Upvotes: 2

Views: 1085

Answers (1)

Binal Patel
Binal Patel

Reputation: 123

Seems to be because of a bug in caret. See the answer to:

parRF on caret not working for more than one core

Just dealt with this same issue, loading foreach on each new cluster manually seems to work.

Upvotes: 2

Related Questions