Reputation: 649
I am using scikit-learn to carry out Sentiment Analysis of text. My features right now are just word frequency counts.
When I do the following, the averaged F-measure is around 59%:
from sklearn import svm
clf = svm.LinearSVC(class_weight='auto');
clf.fit(Xfeatures, YLabels);
......
predictedLabels = clf.predict(XTestFeatures);
But when I use StandardScalar() to scale my feature vector, the averaged F-measure drops to 49%:
from sklearn import svm
clf = svm.LinearSVC(class_weight='auto');
Xfeatures = scaler.fit_transform(Xfeatures);
clf.fit(Xfeatures, YLabels);
......
XTestFeatures = scaler.transform(XTestFeatures);
predictedLabels = clf.predict(XTestFeatures);
Scaling is supposed to improve the performance of my SVM, but here, it seems to decrease the performance. Why does this happen? How can I make it right?
Upvotes: 3
Views: 3082
Reputation: 363567
Scaling by mean and variance isn't a good strategy for term frequencies. Suppose you have two term histograms with three terms (let's just call them 0, 1, 2
):
>>> X = array([[100, 10, 50], [1, 0, 2]], dtype=np.float64)
and you scale them; then you get
>>> from sklearn.preprocessing import scale
>>> scale(X)
array([[ 1., 1., 1.],
[-1., -1., -1.]])
The scaling just made it impossible to tell that term 2 occurred more often in X[1]
than term 0 did. In fact, the fact that term 1 did not occur in X[1]
is no longer distinguishable.
Of course, this is a very extreme example, but similar effects occur in larger sets. What you should do instead is normalize the histograms:
>>> from sklearn.preprocessing import normalize
>>> normalize(X)
array([[ 0.89087081, 0.08908708, 0.4454354 ],
[ 0.4472136 , 0. , 0.89442719]])
This preserves the relative frequencies of the terms, which is what you're interested in; more positive terms than negatives ones is what a linear sentiment classifier cares about, not the actual frequency or a scaled variant of it.
(Scaling is recommended for domains where the scale of individual features actually does not matter, typically because features are measured in different units.)
Upvotes: 5
Reputation: 66805
Tnere are at least few things to consider:
C=1
and; it might bring any, nearly random results, you have to fit the best hyperparameters by some optimziation technique (at least grid search) in order to compare two different data processing (for example your scaling)Upvotes: 3