makemyDNA
makemyDNA

Reputation: 73

Extracting feature importances from an sklearn pipeline containing a multioutputclassifier within gridsearchcv?

I'm wondering whether I can extract feature importances with names from a scikit-learn pipeline that I've built. The pipeline contains a Gradient Boosting Classifier wrapped in a Multi Output classifier. The pipeline is part of a GridSearchCV object.

This is my code:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.impute import IterativeImputer
from sklearn.model_selection import ShuffleSplit

# generate x and y datasets
x_excluded = ['study_id', 'cv19_1_vaccommunity', 'cv19_1_vacprotectself', 'cv19_2_vaccommunity', 
'cv19_2_vacprotectself', 'subject_id']
X = ml_df[[c for c in ml_df.columns if c not in x_excluded]]

y_included = ['cv19_1_vaccommunity', 'cv19_1_vacprotectself', 'cv19_2_vaccommunity', 
'cv19_2_vacprotectself']
y = np.array(ml_df[[c for c in ml_df.columns if c in y_included]])

# instantiate gradient boosting pipeline
gb_pipeline = Pipeline([
    ('ss', StandardScaler()),
    ('imp', IterativeImputer(n_nearest_features=5, verbose=2)),
    ('clf', MultiOutputClassifier(GradientBoostingClassifier(verbose=1)))
])

# make a grid of the parameters we want to test
parameters = {
    'clf__estimator__learning_rate': [0.05, 0.1, 0.5],
    'clf__estimator__max_features': ['auto', 'log2'],
    'clf__estimator__warm_start': [False, True], 
    'clf__estimator__loss': ['deviance', 'exponential']
}

# instantiate crossvalidator
cv = ShuffleSplit(n_splits=5, random_state=1, test_size=0.2)

# instantiate GridSearchCV object
search = GridSearchCV(estimator=gb_pipeline, param_grid=parameters, cv=cv)

search.fit(X, y)

When I try to extract feature importances as follows, I get an error:

feature_importances = search.best_estimator_._final_estimator.feature_importances_

output:
TypeError: 'MultiOutputClassifier' object is not subscriptable

I've also tried extracting feature importances directly from the pipeline object itself as follows:

gb_feat_impts = [clf.feature_importances_ for clf in gb_pipeline.named_steps['clf'].estimators_]

output:
AttributeError: 'MultiOutputClassifier' object has no attribute 'estimators_'

Does anybody have ideas on how to do this?

Upvotes: 2

Views: 686

Answers (1)

makemyDNA
makemyDNA

Reputation: 73

I just figured this out. This creates a list of mean feature importances among all outputs of the multioutputclassifier.

import numpy as np

# make list of feature importances for each output
gb_feat_impts = [clf.feature_importances_ for clf in search.best_estimator_._final_estimator.estimators_]

# calculate mean feature importance from all outputs of multioutputclassifier
gb_impts_means = np.mean(gb_feat_impts, axis = 0)

# zip together column names with feature importances, sort by importance
gb_means_names = sorted(zip(X.columns, gb_impts_means), key = lambda x: x[1], reverse=True)

Upvotes: 1

Related Questions