user2431438
user2431438

Reputation: 307

Weka - How do I check if there is overfitting in weka?

In weka, how do I check if an induced tree overfits the training data?

EDIT:

So now these are the results of my Random Forest classifier building on a large training set and a much-smaller validation set (generated dynamically based on the class ratio of the large training set).

You said that if there is overfitting, the performance of the test set (I call it validation set) would drop terribly? But in this case it doesn't seem to drop much.

Large training set (25000 records)

=== Evaluation on training set ===
=== Summary ===

Correctly Classified Instances       24849               99.3563 %
Incorrectly Classified Instances       161                0.6437 %
Kappa statistic                          0.9886
Mean absolute error                      0.0344
Root mean squared error                  0.0887
Relative absolute error                 30.31   %
Root relative squared error             37.2327 %
Total Number of Instances            25010     

Validation set (IID?) (5000 records)

=== Evaluation on training set ===
=== Summary ===

Correctly Classified Instances        4951               99.02   %
Incorrectly Classified Instances        49                0.98   %
Kappa statistic                          0.9827
Mean absolute error                      0.0402
Root mean squared error                  0.0999
Relative absolute error                 35.269  %
Root relative squared error             41.8963 %
Total Number of Instances             5000     

Upvotes: 2

Views: 5273

Answers (2)

Hitesh
Hitesh

Reputation: 21

If i am not wrong then the output results which are shown above regarding the accuracy that refers to the evaluating your classifier on the complete dataset(Training), it is not about classifying on any test data, to get the complete accuracy result you need to work with the Train/Test Splits or you can work with the external test split this will provide some better idea regarding the results of the classifier.

Upvotes: 1

Wesley Baugh
Wesley Baugh

Reputation: 3770

Easy. Use a completely separate test set. That is, use a test set which contains no instances in common with the training set. Do not use cross validation, or any other means of testing on your training data.

Note: by default Weka's decision trees use pruning. That is, they attempt to generalize the tree (read: prevent over fitting) by using statistical techniques to prune the tree before true leaf nodes are reached when there is no statistical good reason to make additional decision-nodes. The only way to really know if a decision tree is over-fitting your training data is to check against an IID test set. If you are over-fitting, then you will get great results when doing cross-validation or otherwise testing on your training set, but terrible results when testing on separate IID test data.

Upvotes: 0

Related Questions