Joy
Joy

Reputation: 19

SVM overfitting in scikit learn

I am building digit recognition classification using SVM. I have 10000 data and I split them to training and test data with a ratio 7:3. I use linear kernel.

The results turn out that training accuracy is always 1 when change training example numbers, however the test accuracy is just around 0.9 ( I am expecting a much better accuracy, at least 0.95). I think the results indicates overfitting. However, I worked on the parameters, like C, gamma, ... they don't change the results very much.

How to deal with overfitting in SVM?

The following is my code:

from sklearn import svm, cross_validation
svc = svm.SVC(kernel = 'linear',C = 10000, gamma = 0.0, verbose=True).fit(sample_X,sample_y_1Num)

clf = svc

predict_y_train = clf.predict(sample_X)
predict_y_test = clf.predict(test_X)    
accuracy_train = clf.score(sample_X,sample_y_1Num) 
accuracy_test =  clf.score(test_X,test_y_1Num)  
    
#conduct cross-validation 

cv = cross_validation.ShuffleSplit(sample_y_1Num.size, n_iter=10,test_size=0.2, random_state=None)
scores = cross_validation.cross_val_score(clf,sample_X,sample_y_1Num,cv = cv)
score_mean = mean(scores) 

Upvotes: 0

Views: 6247

Answers (1)

Steve
Steve

Reputation: 1292

One way to reduce the overfitting is by adding more training observations. Since your problem is digit recognition, it easy to synthetically generate more training data by slightly changing the observations in your original data set. You can generate 4 new observations from each of your existing observations by shifting the digit images one pixel left, right, up, and down. This will greatly increase the size of your training data set and should help the classifier learn to generalize, instead of learning the noise.

Upvotes: 3

Related Questions