user570593
user570593

Reputation: 3520

svm scaling input values

I am using libSVM. Say my feature values are in the following format:

                         instance1 : f11, f12, f13, f14
                         instance2 : f21, f22, f23, f24
                         instance3 : f31, f32, f33, f34
                         instance4 : f41, f42, f43, f44
                         ..............................
                         instanceN : fN1, fN2, fN3, fN4

I think there are two scaling can be applied.

  1. scale each instance vector such that each vector has zero mean and unit variance.

        ( (f11, f12, f13, f14) - mean((f11, f12, f13, f14) ). /std((f11, f12, f13, f14) )
    
  2. scale each colum of the above matrix to a range. for example [-1, 1]

According to my experiments with RBF kernel (libSVM) I found that the second scaling (2) improves the results by about 10%. I did not understand the reason why (2) gives me a improved results.

Could anybody explain me what is the reason for applying scaling and why the second option gives me improved results?

Upvotes: 16

Views: 21191

Answers (3)

Maciej Skorski
Maciej Skorski

Reputation: 3354

The accepted answer speaks of "Standard Scaling", which is not efficient for high-dimensional data stored in sparse matrices (text data is a use-case); in such cases, you may resort to "Max Scaling" and its variants, which works with sparse matrices.

Upvotes: 0

Thanos
Thanos

Reputation: 2572

I believe that it comes down to your original data a lot.

If your original data has SOME extreme values for some columns, then in my opinion you lose some definition when scaling linearly, for example in the range [-1,1].

Let's say that you have a column where 90% of values are between 100-500 and in the remaining 10% the values are as low as -2000 and as high as +2500.

If you scale this data linearly, then you'll have:

-2000 -> -1 ## <- The min in your scaled data
+2500 -> +1 ## <- The max in your scaled data

 100 -> -0.06666666666666665 
 234 -> -0.007111111111111068
 500 ->  0.11111111111111116

You could argue that the discernibility between what was originally 100 and 500 is smaller in the scaled data in comparison to what it was in the original data.

At the end, I believe it very much comes down to the specifics of your data and I believe the 10% improved performance is very coincidental, you will certainly not see a difference of this magnitude in every dataset you try both scaling methods on.

At the same time, in the paper in the link listed in the other answer, you can clearly see that the authors recommend data to be scaled linearly.

I hope someone finds this useful!

Upvotes: 4

user334856
user334856

Reputation:

The standard thing to do is to make each dimension (or attribute, or column (in your example)) have zero mean and unit variance.

This brings each dimension of the SVM into the same magnitude. From http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf:

The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges. Another advantage is to avoid numerical diculties during the calculation. Because kernel values usually depend on the inner products of feature vectors, e.g. the linear kernel and the polynomial ker- nel, large attribute values might cause numerical problems. We recommend linearly scaling each attribute to the range [-1,+1] or [0,1].

Upvotes: 21

Related Questions