Reputation: 2439
I am using StandardScalar
from sklearn
to scale my feature vector, but it doesn't seem to fit the training feature vector properly. Or maybe it is expected behavior, but if it is, could someone explain why (preferably with some mathematical explanation too).
from sklearn.preprocessing import StandardScaler
import numpy as np
scale_inst = StandardScaler()
# train feature vector
x1 = np.array([1, 2, 10, 44, 55])
# test feature vector
x2 = np.array([1, 2, 10, 44, 667])
# first I fit
scale_inst.fit(x1)
# than I transform training vector and test vector
print scale_inst.transform(x1)
print scale_inst.transform(x2)
# OUTPUT
[-0.94627295 -0.90205459 -0.54830769 0.95511663 1.44151861]
[ -0.94627295 -0.90205459 -0.54830769 0.95511663 28.50315638]
Why does it scale 667 to 28.50315638, shouldn't it be scaled to 1.44151861, aka the max value of the training feature vector?
Upvotes: 0
Views: 3621
Reputation: 404
From the StandardScaler
API:
Standardize features by removing the mean and scaling to unit variance
It is trained on x1
, so it uses the variance/mean of x1
in both cases.
So what this does is simply:
>>> (x1 - np.mean(x1)) / np.std(x1)
array([-0.94627295, -0.90205459, -0.54830769, 0.95511663, 1.44151861])
>>> (x2 - np.mean(x1)) / np.std(x1)
array([ -0.94627295, -0.90205459, -0.54830769, 0.95511663, 28.50315638])
You are probably looking for what Sagar proposed.
Upvotes: 3
Reputation: 777
It is behaving correctly, for your used-case, you can use MinMaxScaler or MaxAbsScaler which kind of fits both training and test data in [0, 1] or [-1, 1] respectively.
Upvotes: 2