Reputation: 307
In weka, how do I check if an induced tree overfits the training data?
EDIT:
So now these are the results of my Random Forest classifier building on a large training set and a much-smaller validation set (generated dynamically based on the class ratio of the large training set).
You said that if there is overfitting, the performance of the test set (I call it validation set) would drop terribly? But in this case it doesn't seem to drop much.
Large training set (25000 records)
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 24849 99.3563 %
Incorrectly Classified Instances 161 0.6437 %
Kappa statistic 0.9886
Mean absolute error 0.0344
Root mean squared error 0.0887
Relative absolute error 30.31 %
Root relative squared error 37.2327 %
Total Number of Instances 25010
Validation set (IID?) (5000 records)
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 4951 99.02 %
Incorrectly Classified Instances 49 0.98 %
Kappa statistic 0.9827
Mean absolute error 0.0402
Root mean squared error 0.0999
Relative absolute error 35.269 %
Root relative squared error 41.8963 %
Total Number of Instances 5000
Upvotes: 2
Views: 5273
Reputation: 21
If i am not wrong then the output results which are shown above regarding the accuracy that refers to the evaluating your classifier on the complete dataset(Training), it is not about classifying on any test data, to get the complete accuracy result you need to work with the Train/Test Splits or you can work with the external test split this will provide some better idea regarding the results of the classifier.
Upvotes: 1
Reputation: 3770
Easy. Use a completely separate test set. That is, use a test set which contains no instances in common with the training set. Do not use cross validation, or any other means of testing on your training data.
Note: by default Weka's decision trees use pruning. That is, they attempt to generalize the tree (read: prevent over fitting) by using statistical techniques to prune the tree before true leaf nodes are reached when there is no statistical good reason to make additional decision-nodes. The only way to really know if a decision tree is over-fitting your training data is to check against an IID test set. If you are over-fitting, then you will get great results when doing cross-validation or otherwise testing on your training set, but terrible results when testing on separate IID test data.
Upvotes: 0