Can I show feature importance for MultiOutputClassifier?

Question

I'm trying to recover the feature importance of a multioutput Classifier using a RandomForest.

The MultiOutput model does not show any problems:

import numpy as np
import pandas as pd
import sklearn
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier

## Generate data
x, y = make_multilabel_classification(n_samples=1000, 
                                      n_features=15, 
                                      n_labels = 5, 
                                      n_classes=3, 
                                      random_state=12, 
                                      allow_unlabeled = True)
x_train = x[:700,:]
x_test  = x[701:,:]
y_train = y[:700,:]
y_test  = y[701:,:]

## Generate model
forest = RandomForestClassifier(n_estimators = 100, random_state = 1)
multi_forest = MultiOutputClassifier(forest, n_jobs = -1).fit(x_train, y_train)

## Make prediction
dfOutput_multi_forest = multi_forest.predict_proba(x_test)

The prediction dfOutput_multi_forest does not show any problems, but I want to recover the feature importance of the multi_forest for interpretation of the output.

Using multi_forest.feature_importance_ throws the following error message: AttributeError: 'MultiOutputClassifier' object has no attribute 'feature_importance_'

Does anyone know how to retrieve the feature importance? I'm using scikit v0.20.2

James Dellinger · Accepted Answer

Indeed, it doesn't appear that Sklearn's MultiOutputClassifier has an attribute that contains some sort of amalgamation of the feature importances of all the estimators (in your case, all the RandomForest classifiers) used in the model.

However, it is possible to access the feature importances of each RandomForest classifier, and then average them all together to give you each feature's average importance, across all RandomForest classifiers.

MultiOutputClassifier objects have an attribute called estimators_. If you run multi_forest.estimators_, you will get a list containing an object for each of your RandomForest classifiers.

For each of these RandomForest classifier objects, you can access its feature importances through the feature_importances_ attribute.

To put it all together, this was my approach:

feat_impts = [] 
for clf in multi_forest.estimators_:
    feat_impts.append(clf.feature_importances_)

np.mean(feat_impts, axis=0)

I ran the example code you pasted into your question, and then ran the above block of code to output a list of the following 15 averages:

array([0.09830467, 0.0912088 , 0.05738045, 0.1211305 , 0.03901933,
       0.05429491, 0.06929378, 0.06404416, 0.05676634, 0.04919717,
       0.05244265, 0.0509295 , 0.05615341, 0.09202444, 0.04780991])

Which contains the average importance of each of your 15 features, across each of the 3 random forest classifiers used in your MultiOutputClassifier.

This should at least help you to see which features, on the whole, tended to be more important in making predictions for each of your 3 classes.

Can I show feature importance for MultiOutputClassifier?

Answers (1)

Related Questions