Reputation: 2495
I want to get feature names after I fit the pipeline.
categorical_features = ['brand', 'category_name', 'sub_category']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
numeric_features = ['num1', 'num2', 'num3', 'num4']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
Then
clf = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', GradientBoostingRegressor())])
After fitting with pandas dataframe, I can get feature importances from
clf.steps[1][1].feature_importances_
and I tried clf.steps[0][1].get_feature_names()
but I got an error
AttributeError: Transformer num (type Pipeline) does not provide get_feature_names.
How can I get feature names from this?
Upvotes: 93
Views: 94328
Reputation: 91
### preprocess pipeline
cat_pipeline = Pipeline(steps=[("encode", OneHotEncoder(drop="first"))])
num_pipeline = Pipeline(steps=[("impute", SimpleImputer(strategy="median")),
("scaler", StandardScaler())])
### transformer
preprocess_transformer = ColumnTransformer([("cat_pipeline", cat_pipeline, cat_features),
("num_pipeline", num_pipeline, num_features)])
X_train_transformed = preprocess_transformer.fit_transform(X_train)
### get feature names out
preprocess_transformer.get_feature_names_out(preprocess_transformer.feature_names_in_)
Upvotes: 0
Reputation: 134
If you are looking for how to access column names after successive pipelines with the last one being ColumnTransformer
, you can access them by following this example here:
In the full_pipeline
there are two pipelines gender
and relevent_experience
full_pipeline = ColumnTransformer([
("gender", gender_encoder, ["gender"]),
("relevent_experience", relevent_experience_encoder, ["relevent_experience"]),
])
The gender
pipeline looks like this:
gender_encoder = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
("cat", OneHotEncoder())
])
After fitting the full_pipeline
, you can access the column names using the following snippet
full_pipeline.transformers_[0][1][1].get_feature_names_out()
In my case the output was:
array(['x0_Female', 'x0_Male', 'x0_Other'], dtype=object)
Upvotes: 5
Reputation: 846
You are very close to getting this right. After you build your pipeline:
clf = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', DecisionTreeRegressor())])
Fit clf
onto the features
and the target
variable as follows:
clf.fit(features, target)
and you should be then able to access feature names for OneHotEncoder
:
clf.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names_out()
Upvotes: 1
Reputation: 16966
You can access the feature_names using the following snippet:
clf.named_steps['preprocessor'].transformers_[1][1]\
.named_steps['onehot'].get_feature_names(categorical_features)
Using sklearn >= 0.21 version, we can make it even simpler:
clf['preprocessor'].transformers_[1][1]\
['onehot'].get_feature_names(categorical_features)
Reproducible example:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'brand': ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
'category': ['asdf', 'asfa', 'asdfas', 'as'],
'num1': [1, 1, 0, 0],
'target': [0.2, 0.11, 1.34, 1.123]})
numeric_features = ['num1']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['brand', 'category']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', LinearRegression())])
clf.fit(df.drop('target', 1), df['target'])
clf.named_steps['preprocessor'].transformers_[1][1]\
.named_steps['onehot'].get_feature_names(categorical_features)
# ['brand_NaN' 'brand_aaaa' 'brand_asdfasdf' 'brand_sadfds' 'category_as'
# 'category_asdf' 'category_asdfas' 'category_asfa']
Upvotes: 94
Reputation: 500
Scikit-Learn 1.0 now has new features to keep track of feature names.
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# SimpleImputer does not have get_feature_names_out, so we need to add it
# manually. This should be fixed in Scikit-Learn 1.0.1: all transformers will
# have this method.
# g
SimpleImputer.get_feature_names_out = (lambda self, names=None:
self.feature_names_in_)
num_pipeline = make_pipeline(SimpleImputer(), StandardScaler())
transformer = make_column_transformer(
(num_pipeline, ["age", "height"]),
(OneHotEncoder(), ["city"]))
pipeline = make_pipeline(transformer, LinearRegression())
df = pd.DataFrame({"city": ["Rabat", "Tokyo", "Paris", "Auckland"],
"age": [32, 65, 18, 24],
"height": [172, 163, 169, 190],
"weight": [65, 62, 54, 95]},
index=["Alice", "Bunji", "Cécile", "Dave"])
pipeline.fit(df, df["weight"])
## get pipeline feature names
pipeline[:-1].get_feature_names_out()
## specify feature names as your columns
pd.DataFrame(pipeline[:-1].transform(df),
columns=pipeline[:-1].get_feature_names_out(),
index=df.index)
Upvotes: 40
Reputation: 175
EDIT: actually Peter's comment answer is in the ColumnTransformer doc:
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
To complete Venkatachalam's answer with what Paul asked in his comment, the order of feature names as it appears in the ColumnTransformer .get_feature_names() method depends on the order of declaration of the steps variable at the ColumnTransformer instanciation.
I could not find any doc so I just played with the toy example below and that let me understand the logic.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import RobustScaler
class testEstimator(BaseEstimator,TransformerMixin):
def __init__(self,string):
self.string = string
def fit(self,X):
return self
def transform(self,X):
return np.full(X.shape, self.string).reshape(-1,1)
def get_feature_names(self):
return self.string
transformers = [('first_transformer',testEstimator('A'),1), ('second_transformer',testEstimator('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('scaler',RobustScaler()), ('transformer', column_transformer)]
pipeline = Pipeline(steps)
dt_test = np.zeros((1000,2))
pipeline.fit_transform(dt_test)
for name,step in pipeline.named_steps.items():
if hasattr(step, 'get_feature_names'):
print(step.get_feature_names())
For the sake of having a more representative example I added a RobustScaler and nested the ColumnTransformer on a Pipeline. By the way, you will find my version of Venkatachalam's way to get the feature name looping of the steps. You can turn it into a slightly more usable variable by unpacking the names with a list comprehension:
[i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]
So play around with the dt_test and the estimators to soo how the feature name is built, and how it is concatenated in the get_feature_names(). Here is another example with a transformer which output 2 columns, using the input column:
class testEstimator3(BaseEstimator,TransformerMixin):
def __init__(self,string):
self.string = string
def fit(self,X):
self.unique = np.unique(X)[0]
return self
def transform(self,X):
return np.concatenate((X.reshape(-1,1), np.full(X.shape,self.string).reshape(-1,1)), axis = 1)
def get_feature_names(self):
return list((self.unique,self.string))
dt_test2 = np.concatenate((np.full((1000,1),'A'),np.full((1000,1),'B')), axis = 1)
transformers = [('first_transformer',testEstimator3('A'),1), ('second_transformer',testEstimator3('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('transformer', column_transformer)]
pipeline = Pipeline(steps)
pipeline.fit_transform(dt_test2)
for step in pipeline.steps:
if hasattr(step[1], 'get_feature_names'):
print(step[1].get_feature_names())
Upvotes: 10