Feature Selection in Scikit-learn Encounters Problems with Mixed Variable Types

Question

I'm currently trying to do feature selection for a dataset I have. There's about 50 variables, 35 of which are categorical each of which are either binary or have < 5 possible values. I'm trying to get ~15 input variables before the preprocessing.

I'm trying to use Recursive Feature Elimination with Cross-Validation (RFECV) in scikit-learn. Because there is a mix of continuous and categorical variables, I'm having some problems when I one-hot encode the categoricals that I have two questions about:

Will the RFE still work with the one-hot encodings and will it be accurate?
How can I get which columns before one-hot encodings the selected features correspond to? For example, if it tells me to keep column 20, how do I know which column that corresponds to before preprocessing so I can keep that as an original input variable.

I'm not going to include the preprocessing, but all it does is impute and one hot encodes with no columns dropped.

Here's the two RFECV objects I have:

clf = SVC(kernel="linear")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring="balanced_accuracy")
rfecv.fit(x_train, y_train)


clf2 = ExtraTreesClassifier(random_state=RANDOM_SEED)
rfecv2 = RFECV(estimator=clf2, step=1, cv=StratifiedKFold(10), scoring="balanced_accuracy")
rfecv2.fit(x_train, y_train)

Feature Selection in Scikit-learn Encounters Problems with Mixed Variable Types

Answers (1)

Related Questions