Reputation: 385
i want to make a selection of features tree-based. My dataset has about 30 columns and after doing, there are about 5. Which for me is great, the problem i have, is that the dataset of 5 columns that i get, does not keep the names of the columns and i can not identify them.
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
data = pd.read_csv(file)
X = data.drop('target', 1)
y = data['target']
X.shape #(100000, 30)
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
clf.feature_importances_
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape #(100000, 5)
Can someone help me please?
Upvotes: 0
Views: 840
Reputation: 585
Now when I'm more sure of the answer, please try the following:
mask = model.get_support(indices=False) # this will return boolean mask for the columns
X_new = X.loc[:, mask] # the sliced dataframe, keeping selected columns
featured_col_names = X_new.columns # columns name index
If all you need is just the column names:
X.columns[model.get_support()]
Upvotes: 1