Vince Demortier
Vince Demortier

Reputation: 75

How to extract the feature importances from the Logitboost algorithm in a multi-class classification setting?

I am currently running a multi-class Logitboost algorithm (docs), which works great. However, when trying to view the importances of different features I get this error message:

NotImplementedError: Feature importances is currently only implemented for binary classification tasks.

When looking at the Github code, I don't really understand why this has not been implemented yet. Does anybody know any way around this such that I can plot the feature importances or is there nothing I can do unless wait for a newer version of Logitboost (which doesn't seem that likely seeing as the last update was several months ago).

I have already tried to modify the Logitboost.py file, but seeing as I have limited knowledge about programming, this is a rather tedious process.

Thanks in advance!

Upvotes: 1

Views: 365

Answers (1)

yatu
yatu

Reputation: 88226

By looking a bit into the source code, the base_estimator defaults to a DecisionTree:

# The default regressor for LogitBoost is a decision stump
_BASE_ESTIMATOR_DEFAULT = DecisionTreeRegressor(max_depth=1)

Which we know does have feature importances, though apparently this mod does not yet implement this method for multiclass problems. By looking into the structure of the fitted classifier though, it seems fairly simple to come up with some custom importance metric.

Let's see with an example, using the iris dataset:

import logitboost
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train , X_test, y_train, y_test = train_test_split(X,y)
lg = logitboost.LogitBoost()
lg.fit(X_train, y_train)

If you look at lg.estimators_, you'll see that the structure is a nested list of fitted decision trees. We could do something as follows to get the overall importance:

l_feat_imp = [sum(cls.feature_importances_ for cls in cls_list) 
              for cls_list in lg.estimators_]
imp = np.array(l_feat_imp).sum(0)
# array([ 9., 19., 51., 71.])

i.e this is just taking the sum of the contributions of each features for all inner lists of estimators, and then again over the summed contributions. So in this case we'd have:

pd.Series(imp, index=load_iris().feature_names).sort_values(ascending=False).plot.bar()

enter image description here

Upvotes: 1

Related Questions