Scaling TEST data which is not true representative of train data

Question

I've built a model which I would like to test on unseen data. I feed the data in daily, which can have a different range everyday. For example, if I use MinMaxScaler(), I scale the training data to [0,1] interval.

Now, the maximum value in the training set is 100, which will be transformed to 1.

When my test data comes in daily, it could actually turn out that maximum value was actually 10, which would also be transformed to 1.

# min_max_scaler = preprocessing.MinMaxScaler()
# df_scaled = min_max_scaler.fit_transform(df.values)

I tried using normalisation instead, e.g. df_norm = (df - df.mean()) / (df.max() - df.min()), and then using these values on the test data:

test_norm = (test_df - df.mean()) / (df.max() - df.min())

But my data is not normally distributed. It is probably exponentially distributed, with high number of 0s and lower large values.

Vivek Kumar · Accepted Answer

No your maximum value of test (ie 10) will not be scaled to 1, but to 0.1 if used properly against learned max and min from training data.

That can be achieved by calling only min_max_scaler.transform() on test data. fit() or fit_transform() is to be used on training data only.

So for training data the code is same:

df_train_scaled = min_max_scaler.fit_transform(df_train.values)

But for testing data, it becomes:

df_test_scaled = min_max_scaler.transform(df_test.values)

This way, the MinMaxScaler will store the max and min values seen during the fit() on the training data and then use them on test data, to properly scale the data.

Scaling TEST data which is not true representative of train data

Answers (1)

Related Questions