DuplicitousManowar
DuplicitousManowar

Reputation: 65

Feature Selection in Scikit-learn Encounters Problems with Mixed Variable Types

I'm currently trying to do feature selection for a dataset I have. There's about 50 variables, 35 of which are categorical each of which are either binary or have < 5 possible values. I'm trying to get ~15 input variables before the preprocessing.

I'm trying to use Recursive Feature Elimination with Cross-Validation (RFECV) in scikit-learn. Because there is a mix of continuous and categorical variables, I'm having some problems when I one-hot encode the categoricals that I have two questions about:

  1. Will the RFE still work with the one-hot encodings and will it be accurate?
  2. How can I get which columns before one-hot encodings the selected features correspond to? For example, if it tells me to keep column 20, how do I know which column that corresponds to before preprocessing so I can keep that as an original input variable.

I'm not going to include the preprocessing, but all it does is impute and one hot encodes with no columns dropped.

Here's the two RFECV objects I have:

clf = SVC(kernel="linear")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring="balanced_accuracy")
rfecv.fit(x_train, y_train)


clf2 = ExtraTreesClassifier(random_state=RANDOM_SEED)
rfecv2 = RFECV(estimator=clf2, step=1, cv=StratifiedKFold(10), scoring="balanced_accuracy")
rfecv2.fit(x_train, y_train)

Upvotes: 0

Views: 627

Answers (1)

Roim
Roim

Reputation: 3066

  1. One hot encoding turns your categorical features into discrete. It will work just fine with it. You should ask yourself if RFE will work with categorical data (answer: depends on estimator), but it will be just fine with binary features. Eventually one-hot encoding is just a group of binary features. The accuracy should be fine even with one-hot encoding.

  2. Unfortunately, there is no "automatic" way to do so. You'll have to do it manually in some way. The best automated way I can think about is to save a mapping, and then use it. For example save a dict: my_dict = {"Food_Pizza" : "Food", "Food_Pasta" : "Food"}. Then you just call for orig_column = my_dict[new_column] to have the regular column. Other option depends how you features named and one-hot encode. For example, if all your one-hot encoding is "FeatureName_value" (like in pandas dummies) you can just parse the name and take everything before the "_" char.

Upvotes: 1

Related Questions