Reputation: 156
I have a training dataset of 1000 samples. It contains about 50 features out of which 30 are categorical features where as the rest are numerical/continuous features. Which algorithm is best suited to handle mixed feature set of both categorical and continuous features?
Upvotes: 3
Views: 5387
Reputation: 1
Standardizing the categorical or discrete variables (i.e. either 0 or 1) is not a good idea, because normalized inputs will be following an out-of-training data distribution (e.g. 0.5 or 0.7), which were never part of the input signal.
Upvotes: 0
Reputation: 40909
In general, a preferred approach is to convert all your features into standardized continuous features.
For features that were originally continuous, perform standardization: x_i = (x_i - mean(x)) / standard_deviation(x). That is, for each feature, subtract the mean of the feature and then divide by the standard deviation of the feature. An alternative approach is to convert the continuous features into the range [0, 1]: x_i = (x_i - min(x)) / (max(x) - min(x)).
For categorical features, perform binarization on them so that each value is a continuous variable taking on the value of 0.0 or 1.0. For example, if you have a categorical variable "gender" that can take on values of MALE, FEMALE, and NA, create three binary binary variables IS_MALE, IS_FEMALE, and IS_NA, where each variable can be 0.0 or 1.0. You can then perform standardization as in step 1.
Now you have all your features as standardized continuous variables.
Upvotes: 2