Reputation: 1227
I would like to know, which features got selected by using SelectKBest()
, so I did first the ColumnTransformer()
.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler #need positive values for chi2 in SelectKBest()
num_features = [...]
cat_features = [...]
ct = ColumnTransformer([
("scaling", MinMaxScaler(), num_features),
("onehot", OneHotEncoder(sparse=False, handle_unknown='ignore'), cat_features)],
remainder='passthrough') #pass through
X_train_trans = ct.fit_transform(X_train)
And then the SelectKBest()
:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
skb = SelectKBest(chi2, k=100)
X_train_trans_select = skb.fit_transform(X_train_trans, y_train)
I have trouble now understanding, which features got selected. I am aware of skb.get_support()
and ct.get_feature_names()
, but ct.get_feature_names()
gives me
AttributeError: Transformer scaling (type MinMaxScaler) does not provide get_feature_names.
Upvotes: 1
Views: 677
Reputation: 6667
What could work for your case is to first store the column names in a list, checking if the transformer has the get_feature_names
attribute then call it otherwise store the original column names.
import itertools
cols = [(transformer[1].get_feature_names() if getattr(transformer[1], "get_feature_names", None) else transformer[2])
for transformer in ct.transformers_]
cols = list(itertools.chain(*cols))
then filter cols
by the boolean index obtained from the get_support()
method of SelecKBest
from itertools import compress
list(compress(cols, skb.get_support()))
Full Reproducible Example
import random
import itertools
import pandas as pd
from itertools import compress
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
# First build some data with categorical and numerical features
data = load_iris()
X, y, feature_names = data['data'], data['target'], data['feature_names']
X = pd.DataFrame(X, columns=feature_names)
X['some_location'] = [random.choice(['NY', 'Texas', 'Boston']) for _ in range(X.shape[0])]
# Apply the column transformers
num_features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
cat_features = ['some_location']
ct = ColumnTransformer([
("scaling", MinMaxScaler(), num_features),
("onehot", OneHotEncoder(sparse=False, handle_unknown='ignore'), cat_features)],
remainder='passthrough') #pass through
X_train_trans = ct.fit_transform(X)
# Get the column names
cols = [(transformer[1].get_feature_names() if getattr(transformer[1], "get_feature_names", None) else transformer[2])
for transformer in ct.transformers_]
cols = list(itertools.chain(*cols))
cols
>>>
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)',
'x0_Boston',
'x0_NY',
'x0_Texas']
# Apply SelectKBest
skb = SelectKBest(chi2, k=4)
X_train_trans_select = skb.fit_transform(X_train_trans, y)
# Get selected columns
list(compress(cols, skb.get_support()))
>>>
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
Upvotes: 1