Reputation: 73
I'm wondering whether I can extract feature importances with names from a scikit-learn pipeline that I've built. The pipeline contains a Gradient Boosting Classifier wrapped in a Multi Output classifier. The pipeline is part of a GridSearchCV object.
This is my code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.impute import IterativeImputer
from sklearn.model_selection import ShuffleSplit
# generate x and y datasets
x_excluded = ['study_id', 'cv19_1_vaccommunity', 'cv19_1_vacprotectself', 'cv19_2_vaccommunity',
'cv19_2_vacprotectself', 'subject_id']
X = ml_df[[c for c in ml_df.columns if c not in x_excluded]]
y_included = ['cv19_1_vaccommunity', 'cv19_1_vacprotectself', 'cv19_2_vaccommunity',
'cv19_2_vacprotectself']
y = np.array(ml_df[[c for c in ml_df.columns if c in y_included]])
# instantiate gradient boosting pipeline
gb_pipeline = Pipeline([
('ss', StandardScaler()),
('imp', IterativeImputer(n_nearest_features=5, verbose=2)),
('clf', MultiOutputClassifier(GradientBoostingClassifier(verbose=1)))
])
# make a grid of the parameters we want to test
parameters = {
'clf__estimator__learning_rate': [0.05, 0.1, 0.5],
'clf__estimator__max_features': ['auto', 'log2'],
'clf__estimator__warm_start': [False, True],
'clf__estimator__loss': ['deviance', 'exponential']
}
# instantiate crossvalidator
cv = ShuffleSplit(n_splits=5, random_state=1, test_size=0.2)
# instantiate GridSearchCV object
search = GridSearchCV(estimator=gb_pipeline, param_grid=parameters, cv=cv)
search.fit(X, y)
When I try to extract feature importances as follows, I get an error:
feature_importances = search.best_estimator_._final_estimator.feature_importances_
output:
TypeError: 'MultiOutputClassifier' object is not subscriptable
I've also tried extracting feature importances directly from the pipeline object itself as follows:
gb_feat_impts = [clf.feature_importances_ for clf in gb_pipeline.named_steps['clf'].estimators_]
output:
AttributeError: 'MultiOutputClassifier' object has no attribute 'estimators_'
Does anybody have ideas on how to do this?
Upvotes: 2
Views: 686
Reputation: 73
I just figured this out. This creates a list of mean feature importances among all outputs of the multioutputclassifier.
import numpy as np
# make list of feature importances for each output
gb_feat_impts = [clf.feature_importances_ for clf in search.best_estimator_._final_estimator.estimators_]
# calculate mean feature importance from all outputs of multioutputclassifier
gb_impts_means = np.mean(gb_feat_impts, axis = 0)
# zip together column names with feature importances, sort by importance
gb_means_names = sorted(zip(X.columns, gb_impts_means), key = lambda x: x[1], reverse=True)
Upvotes: 1