caret: `predict` fails when `train` formula has deleted variables

Question

TL/DR ANSWER: specify training data in newdata argument.

How do I consistently extract class probabilities from trained models with caret's predict? Currently I get an error when the argument to predict was trained with the formula notation and a variable was indicated to be ignored with -variable.

This can be reproduced with:

fit.lda <- train(Species ~ . -Petal.Length, 
  data = iris, 
  preProcess = c("center", "scale"), 
  trControl = trainControl(method = "repeatedcv", 
    number = 10, 
    repeats = 3, 
    classProbs = TRUE, 
    savePredictions = "final", 
    selectionFunction = "best", 
    summaryFunction = multiClassSummary), 
  method = "lda", 
  metric = "Mean_F1")

and then the following line will fail:

predict(fit.lda, type = "prob")

Error in predict.lda(modelFit, newdata) : wrong number of variables

If the -Petal.Length is omitted in the train formula, there is no error. Am I doing something wrong with the formula statement?

~~I suppose I could dig into the model's pred slot and grab the columns corresponding to the class types (see EDIT2), but this seems hackish.~~ Is there a way to get predict to work as expected?

=====EDIT=====

I trained a number of different models (using formula notation) with caretList from the caretEnsemble package, and I got various errors when trying to use predict:

knn

Error in knn3Train(train = c(....) : dims of 'test' and 'train differ

svmRadial:

Warning message: In method$prob(modelFit = modelFit, newdata = newdata, submodels = param) : kernlab class probability calculations failed; returning NAs

mlpML:

Error in myFunc[[1]](x, ...) : number of input data columns 28 does not match number of input neurons 20

Methods that worked without errors were nnet and tree based methods (rf, xgbTree)

=====EDIT2=====

The following doesn't take repeated resampling into account. The selected answer is much simpler.

~~Here's a self-fashioned solution for extracting probabilities from the trained model, but for standardization, I'd prefer if it's possible to get predict to behave.~~

~~grabProbs <- function(model) model$pred[, colnames(model$pred) %in% model$levels] grabProbs(fit.lda)~~

Sandipan Dey · Accepted Answer

Just use the newdata parameter and it will work

predict(fit.lda, newdata = iris, type = "prob")

[EDITED]

As we can see, for lda the prediction result is identical:

library(MASS)
fit.lda <- lda(Species ~ . -Petal.Length, data = iris)
identical(predict(fit.lda), predict(fit.lda, newdata=iris))
# [1] TRUE

library(randomForest)
fit.rf <- randomForest(Species ~ . -Petal.Length, data = iris)
identical(predict(fit.rf), predict(fit.rf, newdata=iris))
# [1] FALSE

caret: `predict` fails when `train` formula has deleted variables

Answers (1)

Related Questions