Difference between running XGBoost on top X most important features and using the transform method

Question

recently I've been working on a XGBoost model, and using it for feature selection based on the feature importance scores (https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/)

This technique builds models iteratively based on the most important features:

First building a model based on all features and giving each feature an importance score.
Then building models iteratively: building a model based on the most important feature, then on the 2 most important features, then on the 3 most important features and so on).

The code of building of the models iteratively (taken from the link attached at the top):

# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = MyXGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_, reverse=True)
for thresh in thresholds:
 # select features using threshold
 selection = SelectFromModel(model, threshold=thresh, prefit=True)
 select_X_train = selection.transform(X_train)
 # train model
 selection_model = XGBClassifier()
 selection_model.fit(select_X_train, y_train)
 # eval model
 select_X_test = selection.transform(X_test)
 predictions = selection_model.predict(select_X_test)
 accuracy = accuracy_score(y_test, predictions)
 print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

My question is - why do I need this line in order to select the model features from the test set:

select_X_test = selection.transform(X_test)

Why can't I just select the top len(select_X_train) most important features from model.feature_importances_ and use this subset of the test set for prediction? When I tried doing so I got a low performing model that labeled almost every instance as true, but when I used selection.transform(X_test) I got a model with a way better performance (~70% precision and recall).

Thanks in advance!

m13op22 · Accepted Answer

Why can't I just select the top len(select_X_train) most important features from model.feature_importances_ and use this subset of the test set for prediction?

Because choosing the top "len(select_X_train)" features like X_train[:, :len(select_X_train)] (which really should be X_train[:, :select_X_train.shape[1]] since len returns the number of rows) does not give you the top most important features! That's why your model performance was worse.

Why not? Because the features in X_train aren't sorted by importance (from highest to lowest), they're in the order you loaded them in.

How do you get the most important features from the SelectFromModel object? Well, you can use the get_support() method. Look at the case where thresh <= 0.13577053 or n=3.

print(selection.get_support())

# Returns the features to choose from the range(X_train.shape[1])
[False  True False False False  True False  True]

If you take the first "n=select_X_train.shape[1]" features, you take the first n features in your dataset. For X_test that would be

[[102.   30.8  26. ]
 [ 77.   33.3  24. ]
 [124.   35.4  34. ]
 [111.   30.1  23. ]
 [108.   30.8  21. ]]

To get the correct features, you'd have to do

print(X_test[:5, selection.get_support()]) # prints first 5 rows

[[ 90.   27.2  24. ]
 [181.   35.9  51. ]
 [152.   26.8  43. ]
 [ 93.   28.7  23. ]
 [125.   27.6  49. ]]

But that's what the transform function does for you. The line

select_X_test = selection.transform(X_test)

print("Transformed select_X_test")
print(select_X_test[:5,:])

Transformed select_X_test: 
[[ 90.   27.2  24. ]
 [181.   35.9  51. ]
 [152.   26.8  43. ]
 [ 93.   28.7  23. ]
 [125.   27.6  49. ]]

selection saves the features chosen when SelectFromModel is run on X_train and then selects those features when you apply to X_test. Hence the line

select_X_test = selection.transform(X_test)

Difference between running XGBoost on top X most important features and using the transform method

Answers (2)

Related Questions