Reputation: 373
I want to retrieve label / class specific feature importances from a Random Forest or a Decision tree without training n_class times a one vs. rest model.
As a library I am using scikit-learn in Python. The models are an instance of either the tree.DecisionTreeClassifier()
or RandomForestClassifier()
class.
Since the feature_importances_
attribute only returns the importance of each feature throughout the whole model, this is unfortunately not quite helpful for me!
Upvotes: 2
Views: 1641
Reputation: 5355
The build-in function "importance" should be used carefully! The importance can be calculated in many different ways: how often is a variable split on? What's the average impurity after a split for each variable etc. etc. i.e it is very important that you know exactly how the importance is calculated, and that you agree that it indeed correspond to your "importance"-understanding.
I would recommend looking at shap for calculating the SHAP-values that gives a more robust and "correct" answer of the importances.
Upvotes: 0
Reputation: 1139
To get the label, you can create pandas.Series
and assign index as the column names of training data. Important features returns by the RandomForestClassifier
is keeping the training data columns in the order.
rfc = RandomForestClassifier(n_estimators=500)
rfc.fit(X,y)
# In the following pandas series you can mention index as X.columns
importances = pd.Series(rfc.feature_importances_,index=X.columns)
print(importances)
Pclass 0.083675
Sex 0.190060
Age 0.234741
SibSp 0.051893
Parch 0.034452
Fare 0.254560
Embarked 0.031173
titles 0.119446
dtype: float64
print(X)
Pclass Sex Age SibSp Parch Fare Embarked titles
0 3 0 22.000000 1 0 7.2500 0 12
1 1 1 38.000000 1 0 71.2833 1 13
2 3 1 26.000000 0 0 7.9250 0 9
3 1 1 35.000000 1 0 53.1000 0 13
4 3 0 35.000000 0 0 8.0500 0 12
... ... ... ... ... ... ... ... ...
886 2 0 27.000000 0 0 13.0000 0 15
887 1 1 19.000000 0 0 30.0000 0 9
888 3 1 29.699118 1 2 23.4500 0 9
889 1 0 26.000000 0 0 30.0000 1 12
890 3 0 32.000000 0 0 7.7500 2 12
print(X.columns)
>>> Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'titles'], dtype='object')
Please refer Feature importances with a forest of trees for more details
Upvotes: 0