Reputation: 53
I trained two svms (LIBSVM) with 15451 samples after I did a 10-fold cross-validation and found the best parameter values for gamma and C (RBF kernel). In one svm I used just 1 feature and in the second an additional one (to see whether this additional is improving prediction). After CV I have am accuracy of 75 % (SVM with one feature) and 77 % (SVM with that additional one). After testing on another 15451 instances I have an accuracy of 70 % and 72 % respectively.
I know that this is called overfitting but is it significant here, since it is only a difference of 5 %.
What could I do to avoid overfitting?
Is it even good to use just one or two features and a relatively big training set?
Hope you can help me out.
Upvotes: 5
Views: 1290
Reputation: 950
There seems to be some confusion about overfitting here.
In short, "overfitting" does NOT mean that your accuracy on fitting the training set is (disproportionately) higher than fitting a generic test set. Rather, this is the effect and not the cause.
"Overfitting" means that your model is trying too hard to fit the training set at any cost, and after picking up all the signal there is it is starting to fit noise. As a (very standard) example, imagine to generate data points coming from a straight line, but then add a little Gaussian noise: the points will be "roughly" on a line, but not exactly. You are overfitting when you try to find a curve that will go through each and every point (say for example a polynomial of grade 27) when all you really needed was a straight line.
One way to check this visually is to draw a learning curve.
This webpage looks informative so I would start here to know more: http://www.astroml.org/sklearn_tutorial/practical.html
Upvotes: 4