Reputation: 67
I have ran a prediction random forest model. WHen I run the below code I get two different confusion matrixs - the only difference is one I use data = train in the predict function and one I just use 'train'. Why would this make such a difference -the recall on one is alot worse.
conf.matrix <- table(train$Status,predict(fit2,train))
Pred:Churn Pred:Current
Actual:Churn 2543 984
Actual:Current 44 27206
conf.matrix <- table(train$Status,predict(fit2,data = train))
Pred:Churn Pred:Current
Actual:Churn 1609 1918
Actual:Current 464 26786
Many thanks.
Upvotes: 1
Views: 223
Reputation: 57696
The data
argument in your 2nd example is ignored, because the correct argument name is newdata
as @mtoto and @agenis noted. In the absence of newdata
, predict.randomForest
will return the out-of-bag predictions for the model.
This is what you want to do.
From my post on CrossValidated:
Be aware that there's a difference between
predict(model)
and
predict(model, newdata=train)
when getting predictions for the training dataset. The first option gets the out-of-bag predictions from the random forest. This is generally what you want, when comparing predicted values to actuals on the training data.
The second treats your training data as if it was a new dataset, and runs the observations down each tree. This will result in an artificially close correlation between the predictions and the actuals, since the RF algorithm generally doesn't prune the individual trees, relying instead on the ensemble of trees to control overfitting. So don't do this if you want to get predictions on the training data.
Upvotes: 1