1a2a3a 4a5a6a
1a2a3a 4a5a6a

Reputation: 103

Isn't it dangerous to apply Min Max Scaling to the test set?

Here's the situation I am worrying about.

Let me say I have a model trained with min-max scaled data. I want to test my model, so I also scaled the test dataset with my old scaler which was used in the training stage. However, my new test data's turned out to be the newer minimum, so the scaler returned negative value.

As far as I know, minimum and maximum aren't that stable value, especially in the volatile dataset such as cryptocurrency data. In this case, should I update my scaler? Or should I retrain my model?

Upvotes: 2

Views: 1732

Answers (2)

Jeff
Jeff

Reputation: 346

I happen to disagree with @Sharan_Sundar. The point of scaling is to bring all of your features onto a single scale, not to rigorously ensure that they lie in the interval [0,1]. This can be very important, especially when considering regularization techniques the penalize large coefficients (whether they be linear regression coefficients or neural network weights). The combination of feature scaling and regularization help to ensure your model generalizes to unobserved data.

Scaling based on your "test" data is not a great idea because in practice, as you pointed out, you can easily observe new data points that don't lie within the bounds of your original observations. Your model needs to be robust to this.

In general, I would recommend considering different scaling routines. scikitlearn's MinMaxScaler is one, as is StandardScaler (subtract mean and divide by standard deviation). In the case where your target variable, cryptocurrency price can vary over multiple orders of magnitude, it might be worth using the logarithm function for scaling some of your variables. This is where data science becomes an art -- there's not necessarily a 'right' answer here.

(EDIT) - Also see: Do you apply min max scaling separately on training and test data?

Upvotes: 3

Sharan
Sharan

Reputation: 731

Ideally you should scale first and then only split into test and train. But its not preferable to use minmax scaler with data which can have dynamically varying min and max values with significant variance in realtime scenario.

Upvotes: 0

Related Questions