Carla Flores
Carla Flores

Reputation: 33

Why do I have values ​outside the normalization range in my test set?

I got X_test values outside the range I specified in the normalization function, why I am getting those and how can I solve it? (This range [:,14:] in X_train and X_test where set because, in my dataset, the numerical values start in that column)

  from sklearn.preprocessing import MinMaxScaler
  scalar = MinMaxScaler(feature_range=(-1,1))
  X_train[:,14:]=scalar.fit_transform(X_train[:,14:])
  X_test[:,14:]=scalar.transform(X_test[:,14:])  

By plotting the X_train and X_test, we can appreciate that the values in X_train are within the range, while in the X_test there are some values outside that range.

This is X_train plot

enter image description here

This is X_test plot

enter image description here

Why is this happening?

Upvotes: 0

Views: 660

Answers (2)

PV8
PV8

Reputation: 6270

You do everything right, and its the normal behavior.

Let's have a look at the offical docs to give you an idea what is going on, the only difference is that we use the feature_range=(0, 1) instead of (-1,1).

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

>>> scaler = MinMaxScaler()
>>> print(scaler.fit_transform(data))
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]

what happened here? The training data is transformed by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where max and min in the feature range

So we getting in the range from 0 - 1

Now we are running it for the new test set, where we don't fit the scaler again, as you are doing also in your case:

>>> print(scaler.transform([[2, 2]]))
[[1.5 0. ]]

So as you can see, the output is also outsite the range. That happens because for the first value the formula is:

X_std = (2 - -1) / (1 - -1) = 3/2
X_scaled = 3/2* (1+0) +0  = 1.5

Upvotes: 1

Nikhil Kumar
Nikhil Kumar

Reputation: 1232

You are using fit on the training set, as should be done.

This means that in the formula (X - X_min) / (X_max - X_min), the X_min and X_max refer to the minimum and maximum values in your training set respectively, NOT the test set.

So if your test set has values outside the minimum and maximum values in your training set, those values in the test set will be mapped outside the feature_range that you provided, by simple arithmetic.

Shouldn't be anything to worry in your case, the test set scaled values are quite close to the feature_range you provided.

Just make sure the values in your test aren't on a scale totally different from those in your training set. You might consider removing the outliers in your test set to solve the issue.

Upvotes: 2

Related Questions