user11086563
user11086563

Reputation:

How to restore the original feature names in XGBoost feature importance plot (after preprocessing removed them)?

Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using pd.get_dummies on data frame columns).

I have trained an XGBoost model using the preprocessed data (centre and scale using MinMaxScaler). Thereby, I am in a similar situation where feature names are lost.

For instance:

    scaler = MinMaxScaler(feature_range=(0, 1))
    X = scaler.fit_transform(X)
    my_model_name = XGBClassifier()
    my_model_name.fit(X,Y)` 

where X and Y are the training data and labels respectively. The scaling above returns a 2D NumPy array, thereby discarding feature names from pandas DataFrame.

Thus, when I try to use plot_importance(my_model_name), it leads to the plot of feature importance, but only with feature names such as f0, f1, f2 etc., and not the actual feature names from the original data set. Is there a way to map the feature names from the original training data to the feature importance plot generated, so that the original feature names are plotted in the graph? Any help in this regard is highly appreciated.

Upvotes: 24

Views: 47036

Answers (4)

rarry
rarry

Reputation: 3573

I think, it is best to turn numpy array back into pandas DataFrame. E.g.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier


Y=label

X_df = pd.read_csv("train.csv")
orig_feature_names = list(X_df.columns)

scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_np = scaler.fit_transform(X_df)
X_scaled_df = pd.DataFrame(X_scaled_np, columns=orig_feature_names)

my_model_name = XGBClassifier(max_depth=2, n_estimators=2)
my_model_name.fit(X_scaled_df,Y)

xgb.plot_importance(my_model_name)
plt.show()

This will show the original names.

Upvotes: 0

Lara Wehbe
Lara Wehbe

Reputation: 33

I tried the above answers, and didn't work while loading the model after training. So, the working code for me is :

model.feature_names

it returns a list of the feature names

Upvotes: 0

Nerxis
Nerxis

Reputation: 3917

You are right that when you pass NumPy array to fit method of XGBoost, you loose the feature names. In such a case calling model.get_booster().feature_names is not useful because the returned names are in the form [f0, f1, ..., fn] and these names are shown in the output of plot_importance method as well.

But there should be several ways how to achieve what you want - supposed you stored your original features names somewhere, e.g. orig_feature_names = ['f1_name', 'f2_name', ..., 'fn_name'] or directly orig_feature_names = X.columns if X was pandas DataFrame.

Then you should be able to:

  • change stored feature names (model.get_booster().feature_names = orig_feature_names) and then use plot_importance method that should already take the updated names and show it on the plot
  • or since this method return matplotlib ax, you can modified labels using plot_importance(model).set_yticklabels(orig_feature_names) (but you have to set the correct order of you features)
  • or you can take model.feature_importances_ and combine it with your original feature names by yourselves (i.e. plotting it by ourselves)
  • similarly, you can also use model.get_booster().get_score() method and combine it with your feature names
  • or you can try Learning API with xgboost DMatrix and specify your feature names during creating of the dataset (after scaling) with train_data = xgb.DMatrix(X, label=Y, feature_names=orig_feature_names) (but I do not have much experience with this way of training since I usually use Scikit-Learn API)

EDIT:

Thanks to @Noob Programmer (see comments below) there might be some "inconsistencies" based on using different feature importance method. Those are the most important ones:

  • xgboost.plot_importance uses "weight" as the default importance type (see plot_importance)
  • model.get_booster().get_score() also uses "weight" as the default (see get_score)
  • model.feature_importances_ depends on importance_type parameter (model.importance_type) and it seems that the result is normalized to sum of 1 (see this comment)

For more info on this topic, look at How to get feature importance.

Upvotes: 12

Binyamin Even
Binyamin Even

Reputation: 3382

You can get the features names by:

model.get_booster().feature_names

Upvotes: 44

Related Questions