Reputation: 33
I got X_test values outside the range I specified in the normalization function, why I am getting those and how can I solve it? (This range [:,14:] in X_train and X_test where set because, in my dataset, the numerical values start in that column)
from sklearn.preprocessing import MinMaxScaler
scalar = MinMaxScaler(feature_range=(-1,1))
X_train[:,14:]=scalar.fit_transform(X_train[:,14:])
X_test[:,14:]=scalar.transform(X_test[:,14:])
By plotting the X_train and X_test, we can appreciate that the values in X_train are within the range, while in the X_test there are some values outside that range.
This is X_train plot
This is X_test plot
Why is this happening?
Upvotes: 0
Views: 660
Reputation: 6270
You do everything right, and its the normal behavior.
Let's have a look at the offical docs to give you an idea what is going on, the only difference is that we use the feature_range=(0, 1) instead of (-1,1).
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit_transform(data))
[[0. 0. ]
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]
what happened here? The training data is transformed by:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
where max and min in the feature range
So we getting in the range from 0 - 1
Now we are running it for the new test set, where we don't fit the scaler again, as you are doing also in your case:
>>> print(scaler.transform([[2, 2]]))
[[1.5 0. ]]
So as you can see, the output is also outsite the range. That happens because for the first value the formula is:
X_std = (2 - -1) / (1 - -1) = 3/2
X_scaled = 3/2* (1+0) +0 = 1.5
Upvotes: 1
Reputation: 1232
You are using fit
on the training set, as should be done.
This means that in the formula (X - X_min) / (X_max - X_min), the X_min and X_max refer to the minimum and maximum values in your training set respectively, NOT the test set.
So if your test set has values outside the minimum and maximum values in your training set, those values in the test set will be mapped outside the feature_range
that you provided, by simple arithmetic.
Shouldn't be anything to worry in your case, the test set scaled values are quite close to the feature_range
you provided.
Just make sure the values in your test aren't on a scale totally different from those in your training set. You might consider removing the outliers in your test set to solve the issue.
Upvotes: 2