Interpreting feature_importances_ in sklearn ensamble methods

Question

After prediction, feature_importances_ ( specifically of GradientBoostingClassifier but may exist for other methods ) holds the feature importances. According to the documentation, the higher, the more important the feature is.

Do you know what do the numbers returned mean?

I get values ranging from 0.02 to 10^-6 or 0.

If a feature has 0.02 importance, then it's importance is 2% out of all features, but how does this relate to prediction accuracy or prediction correlation? Can I interpret this number and understand how the removal of such feature would effect the prediction?

Josh Stone · Accepted Answer

Gilles Louppe, primary author of the sklearn ensemble and tree modules, wrote a great response to the question here.

There are different ways of quantifying how well a node in a decision tree helps partition the incoming dataset into chunks that have output classes which are cumulatively more predictive than before the split. One such measure is gini importance, which is the measure of the decrease in output class impurity that the dataset split at the node provides. This measure, weighted by how many rows of the dataset are actually split using the feature and averaged over all the decision trees in the ensemble, determines feature_importance_ in sklearn.

Interpreting feature_importances_ in sklearn ensamble methods

Answers (1)

Related Questions