Reputation: 31
I am trying to interpret the results of the Random Forest model that I ran, by printing Model_RF_RF
. However, these results look very different to the ones that I obtain by manually comparing the confusion matrix and the accuracy.
Model_RF_RF<-randomForest(Label ~ .,data = train.tokens.tfidf.df,ntree=500,mtry=82,importance=TRUE,proximity=TRUE,trControl = cv.cntrl,nodesize=10)
>Model_RF_RF
Call:
randomForest(formula = Label ~ ., data = train.tokens.tfidf.df, ntree = 500, mtry = 82, importance = TRUE, proximity = TRUE, trControl = cv.cntrl, nodesize = 10)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 82
OOB estimate of error rate: 44.56%
Confusion matrix:
HIGH LOW MEDIUM class.error
HIGH 46 3 72 0.6198347
LOW 3 25 93 0.7933884
MEDIUM 22 20 194 0.1779661
> confusionMatrix(PD5,train$Label )
Confusion Matrix and Statistics
Reference
Prediction HIGH LOW MEDIUM
HIGH 119 0 0
LOW 1 120 3
MEDIUM 1 1 233
Overall Statistics
Accuracy : 0.9874
95% CI : (0.9729, 0.9954)
No Information Rate : 0.4937
P-Value [Acc > NIR] : <2e-16
Kappa : 0.98
Mcnemar's Test P-Value : 0.3916
Statistics by Class:
Class: HIGH Class: LOW Class: MEDIUM
Sensitivity 0.9835 0.9917 0.9873
Specificity 1.0000 0.9888 0.9917
Pos Pred Value 1.0000 0.9677 0.9915
Neg Pred Value 0.9944 0.9972 0.9877
Prevalence 0.2531 0.2531 0.4937
Detection Rate 0.2490 0.2510 0.4874
Detection Prevalence 0.2490 0.2594 0.4916
Balanced Accuracy 0.9917 0.9903 0.9895
Is there any explanation for this behavior?
Upvotes: 2
Views: 295
Reputation: 994
Welcome to Stack Overflow, Manu. The difference is that the result that you are shown within the call to Model_RF_RF
is your OOB (Out of Bag) result, while the one that you print at the end is the result on your training set.
As you know Random Forests use bagging, which means that they use a bootstrapped sample of your data to grow the trees. This means that every single record in your dataset will only be used in a fraction of all the trees that you grow, i.e. those that drew the record during the bootstrapping. The OOB score is therefore obtained by predicting the entries using only the trees that did NOT include the specified entries in the bootstrap, therefore every tree only predicts data it has never seen - and this gives a good (often slightly pessimistic) estimate of your Test error.
It therefore looks like your training accuracy is very good, while your test one is quite low (as the OOB estimate suggests). You can try and test your model on some validation data or use Cross Validation, and you should obtain a score similar to your OOB one.
Try and change the value of mtry
, increase the number of trees, or do some more feature engineering. Good luck!
Upvotes: 3