Confusing confusion matrix parameters changing output

Question

I have ran a prediction random forest model. WHen I run the below code I get two different confusion matrixs - the only difference is one I use data = train in the predict function and one I just use 'train'. Why would this make such a difference -the recall on one is alot worse.

conf.matrix <- table(train$Status,predict(fit2,train))

               Pred:Churn Pred:Current
  Actual:Churn         2543          984
  Actual:Current         44        27206

conf.matrix <- table(train$Status,predict(fit2,data = train))

                Pred:Churn Pred:Current
  Actual:Churn         1609         1918
  Actual:Current        464        26786

Many thanks.

Hong Ooi · Accepted Answer

The data argument in your 2nd example is ignored, because the correct argument name is newdata as @mtoto and @agenis noted. In the absence of newdata, predict.randomForest will return the out-of-bag predictions for the model.

This is what you want to do.

From my post on CrossValidated:

Be aware that there's a difference between
predict(model)
and
predict(model, newdata=train)
when getting predictions for the training dataset. The first option gets the out-of-bag predictions from the random forest. This is generally what you want, when comparing predicted values to actuals on the training data.

The second treats your training data as if it was a new dataset, and runs the observations down each tree. This will result in an artificially close correlation between the predictions and the actuals, since the RF algorithm generally doesn't prune the individual trees, relying instead on the ensemble of trees to control overfitting. So don't do this if you want to get predictions on the training data.

Confusing confusion matrix parameters changing output

Answers (1)

Related Questions