Reputation: 28032
I understand the intuitive meaning of overfitting and underfitting. Now, given a particular machine learning model that is trained upon the training data, how can you tell if the training overfitted or underfitted the data? Is there a quantitative way to measure these factors?
Can we look at the error and say if it has overfit or underfit?
Upvotes: 4
Views: 2540
Reputation: 17871
The existing answers are not strictly speaking wrong, but they are not complete. Yes, you do need a validation set, but an important issue here is that you do not simply look at the model error on the validation set and try to minimize it. It will lead to overfitting all the same, because you will effectively be fitting on a validation set that way. The right approach is not minimizing the error on your sets, but making an error independent from which training and validation sets you use. If error on validation set is significantly different (doesn't matter if it is worse, or better), then the model is overfit. Also, certainly, this should be done in a cross-validation way when you train on some random set and then validate on another random set.
Upvotes: 0
Reputation: 77485
You don't look at the error on the training data, but on the validation data only.
A common way of testing is to try different model complexities, and see how the error changes with model complexity. Usually these have a typical curve. In the beginning, the errors quickly improve. Then there is saturation (where the model is good), then they start decreasing again, but not because of being a better model, but because of overfitting. You want to be on the low complexity end of the plateau, the simplest model that provides a reasonable generalization.
Upvotes: 1
Reputation: 3744
The usual way, I think, is known as cross-validation. The idea is to split the training set into several pieces, known as folds, then pick one at a time for evaluation and train on the remaining ones.
It does not, of course, measure the actual overfitting or underfitting, but if you can vary the complexity of the model, e.g. by changing the regularization term, you can find the optimal point. This is as far as one can go with just training and testing, I think.
Upvotes: 4
Reputation: 2264
I believe the easiest approach is to have two sets of data. Training data and validation data. You train the model on the training data as long as the fitness of the model on the training data is close to the fitness of the model on the validation data. When the models fitness is increasing on the training data but not on the validation data then you're overfitting.
Upvotes: 8