user3207663
user3207663

Reputation: 156

Machine learning algorithm for mixed categorical and numeric features

I have a training dataset of 1000 samples. It contains about 50 features out of which 30 are categorical features where as the rest are numerical/continuous features. Which algorithm is best suited to handle mixed feature set of both categorical and continuous features?

Upvotes: 3

Views: 5387

Answers (2)

old_ai_coder
old_ai_coder

Reputation: 1

Standardizing the categorical or discrete variables (i.e. either 0 or 1) is not a good idea, because normalized inputs will be following an out-of-training data distribution (e.g. 0.5 or 0.7), which were never part of the input signal.

Upvotes: 0

stackoverflowuser2010
stackoverflowuser2010

Reputation: 40909

In general, a preferred approach is to convert all your features into standardized continuous features.

  1. For features that were originally continuous, perform standardization: x_i = (x_i - mean(x)) / standard_deviation(x). That is, for each feature, subtract the mean of the feature and then divide by the standard deviation of the feature. An alternative approach is to convert the continuous features into the range [0, 1]: x_i = (x_i - min(x)) / (max(x) - min(x)).

  2. For categorical features, perform binarization on them so that each value is a continuous variable taking on the value of 0.0 or 1.0. For example, if you have a categorical variable "gender" that can take on values of MALE, FEMALE, and NA, create three binary binary variables IS_MALE, IS_FEMALE, and IS_NA, where each variable can be 0.0 or 1.0. You can then perform standardization as in step 1.

Now you have all your features as standardized continuous variables.

Upvotes: 2

Related Questions