Reputation: 147
I am studying Machine learning course by Andrew Ng and in it he says that more number of features and less amount of data can lead to overfitting. Can someone elaborate on this.
Upvotes: 0
Views: 1118
Reputation: 1259
In general, the less data you have the better your model can memorize the exceptions in your training set which leads to high accuracy on training but low accuracy on test set since your model generalizes what it has learned from the small training set.
For example, consider a Bayesian classifier. We want to predict the math grades of students based on
As we know the last feature is probably irrelevant. provided we have enough data, our model will learn that this data is irrelevant since there will be a people with different heights getting different grades if we the dataset is big enough.
now consider a very small dataset (e.g. only one class). in this case it's very unlikely that students grades are uncorrelated with their heights (e.g. the tall students will be better or less than average). so our model will be able to make use of that feature. the problem is our model has learned a correlation between grade and height that does not exist outside training dataset.
It could also go the other way, our model might learn that everyone who got a good grade last semester will get a good grade this semester (since that might hold in small datasets) and not use other features at all.
A more general reason, as I mentioned earlier, is that the model can memorize the dataset. There are always outlayer samples, which can't be classified easily. When data size is small the model can find a way to detect these outlayers since there are only few of them. However the it will not be able to predict the real outliers in the test set.
Upvotes: 3