Random Forest sklearn Variable Importance

Question

Can the variable importance values given by the attribute feature_importance_ of the sklearn's RandomForestClassifier be interpreted as percentages ? I understand that its the average of reduction in impurity index over all trees when a particular feature is used at split point.What is the range of feature_importance_ values ? For a dataset with 1000 features, if the feature_importance_ values range from 0~0.05, with most of the features at 0, only a few showing a slight increase, does this show that the data is noisy ?

MB-F · Accepted Answer

I understand that its the average of reduction in impurity index over all trees when a particular feature is used at split point.

This is correct. Let's look how the feature importance for a single tree is computed in detail [1]. Impurity reduction is the impurity of a node before the split minus the sum of both child nodes' impurities after the split. This is averaged over all splits in a tree for each feature. Then, the importances are normalized: each feature importance is divided by the total sum of importances.

So, in some sense the feature importances of a single tree are percentages. They sum to one and describe how much a single feature contributes to the tree's total impurity reduction.

The feature importances of a Random Forest are computed as the average of importances over all trees. They can still be seen as the fractional reduction of a single feature. You can (and should) verify if they some to one.

What is the range of feature_importance_ values ?

In theory, the range is 0 to 1 due to the way the importances are normalized. However, in practice the range will be considerably lower. Random Forests randomly pick features and subsets of the data, so there is a good chance that all features are used in a split. Even if they are not very important, they will take a small part of the total importance. Since importances are supposed to some to one, importances will be overally lower the more features you have.

feature_importance_ values range from 0~0.05, with most of the features at 0, only a few showing a slight increase, does this show that the data is noisy ?

No, this most likely means that you have few samples and/or few trees (estimators). An importance of 0 most likely means that the feature was not used at all in the forest. Due to the random nature of this classifier all features should be used at least a little. So this indicates that not many splits were performed.

I assume you don't have many trees because the default of n_estimators=10 is very low. Most literature suggests to use 100s or 1000s of trees. You can't have too many, it will only cost performance.

Finally, a word of warning

Do not rely on feature importance too much. If one feature has a higher importance than another, it probably means that it is more important but you can't be sure:

For example, assume you copied one feature and added it as a new feature to the data. The original feature and the new feature should be equally important. However, whenever a split is performed only one of them can be chosen, so the original feature will randomly be chosen half of the time. This also halves the original feature's importance! Both, the new and the original feature will have only half the importance they had if the other was not in.

If some features in the data are correlated (actually, statistically dependent) they will get a lower importance than an equally important uncorrelated feature.

Random Forest sklearn Variable Importance

Answers (2)

Related Questions