Reputation: 23214
I have a tidy dataset with no missing values and only numeric columns.
The dataset is both large and contains sensitive information, so I won't be able to provide a copy of it here, unfortunately.
I partition this data into training and testing sets with caret
's createDataPartition
:
idx <- createDataPartition(y = model_final$y, p = 0.6, list = FALSE )
training <- model_final[idx,]
testing <- model_final[-idx,]
x <- training[-ncol(training)]
y <- training$y
x1 <- testing[-ncol(testing)]
y1 <- testing$y
row.names(training) <- NULL
row.names(testing) <- NULL
row.names(x) <- NULL
row.names(y) <- NULL
row.names(x1) <- NULL
row.names(y1) <- NULL
I've been fitting and refitting Random Forest models on this data via randomForest
on a regular basis:
rf <- randomForest(x = x, y = y, mtry = ncol(x), ntree = 1000,
corr.bias = T, do.trace = T, nPerm = 3)
I decided to see if I could get any better or faster results with train
and the following model ran fine, but took about 2 hours:
rf_train <- train(y=y, x=x,
method='rf', tuneLength = 3,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE
)
I need to take an HPC approach to make this logistically feasible, so I tried
require(doParallel)
registerDoParallel(cores = 8)
rf_train <- train(y=y, x=x,
method='parRF', tuneGrid = data.frame(mtry = 3), na.action = na.omit,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE, allowParallel = TRUE)
)
but regardless of if I use tuneLength or tuneGrid, this leads to strange errors about missing values and tuning parameters:
Error in train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
final tuning parameters could not be determined
In addition: Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
2: In train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
missing values found in aggregated results
I say this is weird both because there were no errors with method = "rf"
and because I tripled checked to ensure there are no missing values.
I even get the same errors when completely omitting tuning options. I also tried toggling the na.action
option on and off and changing "cv"
to "repeatedcv"
.
I even get the same error with this ultra-simplified version:
rf_train <- train(y=y, x=x, method='parRF')
Upvotes: 2
Views: 1085
Reputation: 123
Seems to be because of a bug in caret. See the answer to:
parRF on caret not working for more than one core
Just dealt with this same issue, loading foreach on each new cluster manually seems to work.
Upvotes: 2