Different way to think about feature importance

Question

In Friedman’s “Greedy Function Approximation” in the Annals of Statistics, 2001, the relative importance of input variables is described in section 8.1. Equation 44 (from Breiman, Friedman, Olshen & Stone, 1983) shows that a feature’s relative importance in a tree is the total (ie sum) improvement in squared error over all nodes splitting on that feature — not normalized or proportional — with equation 45 computing the feature’s relative importance to the GBM by taking the average over all trees of the sum (again, not the average over proportions).

This sum is found in the code here

I am pretty sure that a feature that is rarely used but when it is used it is important would not rank high in this method. The current definition is something like the total utility but I think I want the average. This would take out the issue of how many times it was used. For example if there was a binary feature which was nonzero only 1 in a million rows but when it was it had a huge effect on the prediction. Changing the sum in the above line of code to an average would highlight such features.

Is this something which is done? Is the effect I am worried about already balanced for since the feature importance at a node is weighted by the number of samples at that node? Is there a better way to deal with sparseness and feature importance?

The purpose of thinking of feature importance in this way is to make sure one does not eliminate features which are unimportant in general but crucial in a few rare outlier cases. When doing feature selection is it easy to justify dropping such features when looking at aggregate metrics.

Keith · Accepted Answer

As explained here the feature importance defined through the tree is not a great metric. If you can afford the compute time you are better off using permutation feature importance.

ELI5 has an implementation of this. For comparison you can run the following code to check your trained model clf.

from eli5.sklearn import PermutationImportance
iterations = 5

#http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values
eval_metric = 'r2'
#eval_metric = 'neg_mean_absolute_error' 
#eval_metric = 'neg_mean_squared_error'
#eval_metric = 'explained_variance'


perm_train = PermutationImportance(clf,scoring = eval_metric, n_iter=iterations).fit(X_train, y_train)
feature_importance_train = perm_train.feature_importances_
feature_importance_train_error = perm_train.feature_importances_std_/np.sqrt(iterations)

perm_test = PermutationImportance(clf,scoring = eval_metric, n_iter=iterations).fit(X_test, y_test)
feature_importance_test = perm_test.feature_importances_
feature_importance_test_error = perm_test.feature_importances_std_/np.sqrt(iterations)

# make model importances relative to max importance
feature_importance_model = clf.feature_importances_
feature_importance_model = feature_importance_train.max() * (feature_importance_model / feature_importance_model.max())

sorted_idx = np.argsort(feature_importance_model)
pos = np.arange(sorted_idx.shape[0]) + .5

featfig = plt.figure(figsize=(6, 15))
featfig.suptitle('Feature Importance')
featax = featfig.add_subplot(1, 1, 1)

featax.errorbar(x=feature_importance_train[sorted_idx], y=pos, xerr = feature_importance_train_error[sorted_idx], linestyle='none', marker='.', label = 'Train')
featax.errorbar(x=feature_importance_test[sorted_idx], y=pos, xerr = feature_importance_test_error[sorted_idx],linestyle='none', marker='.', label = 'Test')
featax.errorbar(x=feature_importance_model[sorted_idx], y=pos, linestyle='none', marker='.', label = 'Model')

featax.set_yticks(pos)
featax.set_yticklabels(np.array(features)[sorted_idx], fontsize=8)
featax.set_xlabel(eval_metric + ' change')
featlgd = featax.legend(loc=0)

Since you can choose your evaluation metric then you can choose one which is either more or less sensitive to outliers.

Different way to think about feature importance

Answers (1)

Related Questions