sklearn Standardscaler() can effect test matrix result

Question

I do not come from statistics, however by doing one work with machine learning and NN I have seen scaling data can produce a lot of harm. From what I have learned, scaling data before train-test is not really a good option, but please take a look at this example when scaling is done after train-test separation.

import numpy as np
from sklearn.preprocessing import StandardScaler


train_matrix = np.array([[1,2,3,4,5]]).T

test_matrix = np.array([[1]]).T


e =StandardScaler()
train_matrix = e.fit_transform(train_matrix)
test_matrix = e.fit_transform(test_matrix)

print(train_matrix)

print(test_matrix)

[out]:

[[-1.41421356]   #train data
 [-0.70710678]
 [ 0.        ]
 [ 0.70710678]
 [ 1.41421356]]


[[ 0.]]   #test data

StandardScaler class would do two different scaling processes for each dataset and the error which can harm your NN result is:

in train matrix 1 is -1.41421356, while in test matrix 1 is 0. Now imagine you do a prediction model with test data on training weights. For 1, you would receive completely different result. How to overcome this?

Miriam Farber · Accepted Answer

You shouldn't transform train and test separately. Instead, you should fit the scaler on the training data (and then transform it using the scaler), and then transform the test data with the fitted scaler. So in your code you should do:

e =StandardScaler()
train_matrix = e.fit_transform(train_matrix)
test_matrix = e.transform(test_matrix)

Then when you print the transformed trained and test data you get the expected result:

[[-1.41421356]
 [-0.70710678]
 [ 0.        ]
 [ 0.70710678]
 [ 1.41421356]]


[[-1.41421356]]

sklearn Standardscaler() can effect test matrix result

Answers (1)

Related Questions