Reputation: 65
I'm currently trying to do feature selection for a dataset I have. There's about 50 variables, 35 of which are categorical each of which are either binary or have < 5 possible values. I'm trying to get ~15 input variables before the preprocessing.
I'm trying to use Recursive Feature Elimination with Cross-Validation (RFECV) in scikit-learn. Because there is a mix of continuous and categorical variables, I'm having some problems when I one-hot encode the categoricals that I have two questions about:
I'm not going to include the preprocessing, but all it does is impute and one hot encodes with no columns dropped.
Here's the two RFECV objects I have:
clf = SVC(kernel="linear")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring="balanced_accuracy")
rfecv.fit(x_train, y_train)
clf2 = ExtraTreesClassifier(random_state=RANDOM_SEED)
rfecv2 = RFECV(estimator=clf2, step=1, cv=StratifiedKFold(10), scoring="balanced_accuracy")
rfecv2.fit(x_train, y_train)
Upvotes: 0
Views: 627
Reputation: 3066
One hot encoding turns your categorical features into discrete. It will work just fine with it. You should ask yourself if RFE will work with categorical data (answer: depends on estimator), but it will be just fine with binary features. Eventually one-hot encoding is just a group of binary features. The accuracy should be fine even with one-hot encoding.
Unfortunately, there is no "automatic" way to do so. You'll have to do it manually in some way. The best automated way I can think about is to save a mapping, and then use it. For example save a dict: my_dict = {"Food_Pizza" : "Food", "Food_Pasta" : "Food"}
. Then you just call for orig_column = my_dict[new_column]
to have the regular column. Other option depends how you features named and one-hot encode. For example, if all your one-hot encoding is "FeatureName_value" (like in pandas dummies) you can just parse the name and take everything before the "_" char.
Upvotes: 1