Why does my SVM's performance drop after scaling the training and test data?

I am using scikit-learn to carry out Sentiment Analysis of text. My features right now are just word frequency counts.

When I do the following, the averaged F-measure is around 59%:

from sklearn import svm
clf = svm.LinearSVC(class_weight='auto');
clf.fit(Xfeatures, YLabels);
......
predictedLabels = clf.predict(XTestFeatures);

But when I use StandardScalar() to scale my feature vector, the averaged F-measure drops to 49%:

from sklearn import svm
clf = svm.LinearSVC(class_weight='auto');
Xfeatures = scaler.fit_transform(Xfeatures);
clf.fit(Xfeatures, YLabels);
......
XTestFeatures = scaler.transform(XTestFeatures);
predictedLabels = clf.predict(XTestFeatures);

Scaling is supposed to improve the performance of my SVM, but here, it seems to decrease the performance. Why does this happen? How can I make it right?

Upvotes: 3

Answers (2)

Fred Foo

Reputation: 363567

Scaling by mean and variance isn't a good strategy for term frequencies. Suppose you have two term histograms with three terms (let's just call them 0, 1, 2):

>>> X = array([[100, 10, 50], [1, 0, 2]], dtype=np.float64)

and you scale them; then you get

>>> from sklearn.preprocessing import scale
>>> scale(X)
array([[ 1.,  1.,  1.],
       [-1., -1., -1.]])

The scaling just made it impossible to tell that term 2 occurred more often in X[1] than term 0 did. In fact, the fact that term 1 did not occur in X[1] is no longer distinguishable.

Of course, this is a very extreme example, but similar effects occur in larger sets. What you should do instead is normalize the histograms:

>>> from sklearn.preprocessing import normalize
>>> normalize(X)
array([[ 0.89087081,  0.08908708,  0.4454354 ],
       [ 0.4472136 ,  0.        ,  0.89442719]])

This preserves the relative frequencies of the terms, which is what you're interested in; more positive terms than negatives ones is what a linear sentiment classifier cares about, not the actual frequency or a scaled variant of it.

(Scaling is recommended for domains where the scale of individual features actually does not matter, typically because features are measured in different units.)

Upvotes: 5

lejlot

Reputation: 66805

Tnere are at least few things to consider:

scaling data can decrease accuracy. It should not, but can
accuracy is a wrong measure for imbalanced problems, you use "class_weight='auto'", so it is your case. Use some balanced measure instead like averaged accuracy or MCC.
you seem to use default hyperparameter of linear SVM, meaning C=1 and; it might bring any, nearly random results, you have to fit the best hyperparameters by some optimziation technique (at least grid search) in order to compare two different data processing (for example your scaling)

Upvotes: 3

Why does my SVM&#39;s performance drop after scaling the training and test data?

Answers (2)

Related Questions

Why does my SVM's performance drop after scaling the training and test data?