Adrian
Adrian

Reputation: 373

Get label specific feature importances from Random Forest (and Decision Tree)

I want to retrieve label / class specific feature importances from a Random Forest or a Decision tree without training n_class times a one vs. rest model. As a library I am using scikit-learn in Python. The models are an instance of either the tree.DecisionTreeClassifier() or RandomForestClassifier() class. Since the feature_importances_ attribute only returns the importance of each feature throughout the whole model, this is unfortunately not quite helpful for me!

Upvotes: 2

Views: 1641

Answers (2)

CutePoison
CutePoison

Reputation: 5355

The build-in function "importance" should be used carefully! The importance can be calculated in many different ways: how often is a variable split on? What's the average impurity after a split for each variable etc. etc. i.e it is very important that you know exactly how the importance is calculated, and that you agree that it indeed correspond to your "importance"-understanding.

I would recommend looking at shap for calculating the SHAP-values that gives a more robust and "correct" answer of the importances.

Upvotes: 0

Rinshan Kolayil
Rinshan Kolayil

Reputation: 1139

To get the label, you can create pandas.Series and assign index as the column names of training data. Important features returns by the RandomForestClassifier is keeping the training data columns in the order.

rfc = RandomForestClassifier(n_estimators=500)
rfc.fit(X,y)
# In the following pandas series you can mention index as X.columns
importances = pd.Series(rfc.feature_importances_,index=X.columns)

print(importances)

Pclass      0.083675
Sex         0.190060
Age         0.234741
SibSp       0.051893
Parch       0.034452
Fare        0.254560
Embarked    0.031173
titles      0.119446
dtype: float64

print(X)
Pclass  Sex Age SibSp   Parch   Fare    Embarked    titles
0   3   0   22.000000   1   0   7.2500  0   12
1   1   1   38.000000   1   0   71.2833 1   13
2   3   1   26.000000   0   0   7.9250  0   9
3   1   1   35.000000   1   0   53.1000 0   13
4   3   0   35.000000   0   0   8.0500  0   12
... ... ... ... ... ... ... ... ...
886 2   0   27.000000   0   0   13.0000 0   15
887 1   1   19.000000   0   0   30.0000 0   9
888 3   1   29.699118   1   2   23.4500 0   9
889 1   0   26.000000   0   0   30.0000 1   12
890 3   0   32.000000   0   0   7.7500  2   12

print(X.columns)
>>> Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'titles'], dtype='object')

Please refer Feature importances with a forest of trees for more details

Upvotes: 0

Related Questions