Reputation: 31
What will happen if I use the same training data and validation data for my machine learning classifier?
Upvotes: 2
Views: 408
Reputation: 19312
If you use the same dataset for training and validation then:
training_accuracy = testing_accuracy
Your testing_accuracy
will be the same as training_accuracy
if you use the training dataset as the validation dataset. Therefore you will NOT be able to tell if your model has underfit
or not.
Let's talk about datasets and evaluation metrics. Here is some terminology (reference) -
With the training_accuracy
, you can get a sense of how well a model fits your data and the testing_accuracy
tells you how well that model is generalizable. If train_accuracy
is low, then your model has underfitted
and you may need a better model (better features, different architecture, etc) for modeling the given problem. If training_accuracy
is high but testing_accuracy
is low, this means your model fits the data well, but it's not generalizable on unseen data. This is overfitting
.
Note: In practice, it is better to have a overfit model and regularize it heavily rather than work with an underfit model.
Another important thing you need to understand that training a model (fit
) and inference from a model (predict
/ score
) are 2 separate tasks. Therefore, when you use the validation dataset as the training dataset, you are basically still training
the model on the same training dataset but while inference
, you are using the training dataset which will give you the same accuracy as the training_accuracy
.
You will therefore not come to know if at all you overfit
BUT that doesn't mean you will get 99% accuracy like the other answer to suggest! You may still underfit
and get an extremely low model accuracy
Upvotes: 0
Reputation: 2876
If the train data and the validation data are the same, the trained classifier will have a high accuracy, because it has already seen the data. That is why we use train-test splits. We take 60-70% of the training data to train the classifier, and then run the classifier against 30-40% of the data, the validation data which the classifier has not seen yet. This helps measure the accuracy of the classifier and its behavior, such as over fitting or under fitting, against a real test set with no labels.
Upvotes: 2
Reputation: 313
Generally, we divide the data to validation and training to prevent overfitting. To explain it, we can think a model that classifies that it is human or not and you have dataset contains 1000 human images. If you train your model with all your images in that dataset , and again validate it with again same data set your accuracy will be 99%. However, when you put another image from different dataset to be classified by the your model ,your accuracy will be much more lower than the first. Therefore, generalization of the model for this example is a training a model looking for a stickman to define basically it is human or not instead of looking for specific handsome blonde man. Therefore, we divide dataset into validation and training to generalize the model and prevent overfitting.
Upvotes: 0
Reputation: 82
Basically nothing happens. You are just trying to validate your model's performance on the same data it was trained on, which practically doesn't yield anything different or useful. It is like teaching someone to recognize an apple and asking them to recognize just the same apple and see how they performed.
Why a validation set is used then? To answer this in short, the train and validation sets are assumed to be generated from the same distribution and thus the model trained on training set should perform almost equally well on the examples from validation set that it has not seen before.
Upvotes: 0
Reputation: 310
We create multiple models and then use the validation to see which model performed the best. We also use the validation data to reduce the complexity of our model to the correct level. If you use train data as your validation data, you will achieve incredibly high levels of success (your misclassification rate or average square error will be tiny), but when you apply the model to real data that isn't from your train data, your model will do very poorly. This is called OVERFITTING to the train data.
Upvotes: 1