Should I normalize or standardize my dataset for knn?

Question

I trying to use knn for a classification task and my dataset contains categorical features which are one hot encoded, numerical features like price etc.. and also BoW(CountVectorizer) vectors for my text column.

I know knn is affected by scaling. So I am confused what to use here?

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import normalize

Venkatachalam · Accepted Answer

My suggestion would be to go for MinMaxScaler

One of the major reason is that your features such as price can't have negative values and as you mentioned, it could be sparse.

From Documentation:

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

At the same time, if your numerical variable has a huge variance, then go for RobustScaler or StandardScaler.

You dont have to scale the one hot encoded features.

For BoW, it is important to preserve the sparsity of the data. If you apply the StandardScaler, you will lose the sparsity. You definitely have to go for MinMaxScaler. Another option would be to go for TfidfVectorizer, which does the l2 normalization by default.

Should I normalize or standardize my dataset for knn?

Answers (1)

Related Questions