Arjith Babu
Arjith Babu

Reputation: 31

Scaling of categorical variable

  1. Does categorical variables needs to scaled before model building? I have scaled all my continuous numerical variables using StandardScalear now all the continues variables are between -1 and 1 where as categorical columns are binary.

  2. How will it it affect my model?

  3. Can someone please explain, how a scaled categorical variable will effect the splitting of nodes in the DecisionTreeClassifier

Upvotes: 3

Views: 17390

Answers (1)

Arsik36
Arsik36

Reputation: 327

When you one-hot encode your categorical variables, the values in encoded variables become 0 and 1. Therefore, encoded variables will not negatively affect your model. The fact that you encode variables and pass them to ML learning algorithms is good, as you may gain additional insights from ML models.

When scaling your dataset, make sure you pay attention to 2 things:

  1. Some ML algorithms require data to be scaled, and some do not. It is a good practice to only scale your data for models that are sensitive to un-scaled data, such as kNN.

  2. There are different methods to scale your data. StandardScaler() is one of them, but it is vulnerable to outliers. Therefore, make sure you are using the scaling method that best fits your business needs. You can learn more about different scaling methods here: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

Encoded categorical variables contain values on 0 and 1. Therefore, there is even no need to scale them. However, scaling methods will be applied to them when you choose to scale your entire dataset prior to using your data with scale-sensitive ML models.

Upvotes: 2

Related Questions