Reputation: 259
To my understanding, high variance means the model itself has the problem of over-fitting. But in the Andrew Ng's video lecture, he mentioned that more training data can reduce the high variance. What is the detailed reason?
Upvotes: 1
Views: 1894
Reputation: 1
1- more training data size leads to increase SNR (Signal to Noise Ratio) 2- increasing SNR means that noise is decreased. 3- when the noise has decreased the variance of the model will be decreased. please pay attention that variance has appeared from noise(clean data don't cause variance in model)
Upvotes: 0
Reputation: 5208
Basically, models will overfit if it has too much variance relative to the training set size.
If you have say 5 degrees of freedom, you can perfectly match (fit) 5 samples. But you can't perfectly match a 1000 samples.
So by adding more data samples (and thus hopefully increasing variance in your dataset), you can prevent overfitting.
Unfortunately, it's hard to get more data. It's easier to reduce the degrees of freedom.
Upvotes: 3