StandardScaler in sklearn not fitting properly, or is it?

Question

I am using StandardScalar from sklearn to scale my feature vector, but it doesn't seem to fit the training feature vector properly. Or maybe it is expected behavior, but if it is, could someone explain why (preferably with some mathematical explanation too).

from sklearn.preprocessing import StandardScaler
import numpy as np

scale_inst = StandardScaler()

# train feature vector
x1 = np.array([1, 2, 10, 44, 55])
# test feature vector
x2 = np.array([1, 2, 10, 44, 667])

# first I fit
scale_inst.fit(x1)
# than I transform training vector and test vector
print scale_inst.transform(x1)
print scale_inst.transform(x2)

# OUTPUT
[-0.94627295 -0.90205459 -0.54830769  0.95511663  1.44151861]
[ -0.94627295  -0.90205459  -0.54830769   0.95511663  28.50315638]

Why does it scale 667 to 28.50315638, shouldn't it be scaled to 1.44151861, aka the max value of the training feature vector?

mtzl · Accepted Answer

From the StandardScaler API:

Standardize features by removing the mean and scaling to unit variance

It is trained on x1, so it uses the variance/mean of x1 in both cases. So what this does is simply:

>>> (x1 - np.mean(x1)) / np.std(x1)
array([-0.94627295, -0.90205459, -0.54830769,  0.95511663,  1.44151861])

>>> (x2 - np.mean(x1)) / np.std(x1)
array([ -0.94627295,  -0.90205459,  -0.54830769,   0.95511663, 28.50315638])

You are probably looking for what Sagar proposed.

StandardScaler in sklearn not fitting properly, or is it?

Answers (2)

Related Questions