Reputation: 191
I am trying to generate a model that uses several physico-chemical properties of a molecule (incl. number of atoms, number of rings, volume, etc.) to predict a numeric value Y. I would like to use PLS Regression, and I understand that standardization is very important here. I am programming in Python, using scikit-learn. The type and range for the features varies. Some are int64 while other are float. Some features generally have small (positive or negative) values, while other have very large value. I have tried using various scalers (e.g. standard scaler, normalize, minmax scaler, etc.). Yet, the R2/Q2 are still low. I have a few questions:
Upvotes: 1
Views: 691
Reputation: 1902
The whole idea of scaling is to make models more robust to analysis on features space. For example, if you have 2
features as 5 Kg
and 5000 gm
, we know both are same, but for some algorithm, which are sensitive
to metric space such as KNN
, PCA
etc, they will be more weighted towards second features, so scaling must be done for these algos.
Now coming to your question,
regularization
. It has very good features. if you think, you have many useless-features
, you can use L1
regularization, which creates sparse
effect on features space, which is nothing but assign 0
weight to useless features. Here is the link for more-info.One more point, some method such as tree
based model doesn't need scaling, In last, it mostly depend on the model, you choose.
Upvotes: 2
Reputation: 10375
Upvotes: 1