Or Pickholz
Or Pickholz

Reputation: 23

Difference between running XGBoost on top X most important features and using the transform method

recently I've been working on a XGBoost model, and using it for feature selection based on the feature importance scores (https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/)

This technique builds models iteratively based on the most important features:

  1. First building a model based on all features and giving each feature an importance score.
  2. Then building models iteratively: building a model based on the most important feature, then on the 2 most important features, then on the 3 most important features and so on).

The code of building of the models iteratively (taken from the link attached at the top):

# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = MyXGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_, reverse=True)
for thresh in thresholds:
 # select features using threshold
 selection = SelectFromModel(model, threshold=thresh, prefit=True)
 select_X_train = selection.transform(X_train)
 # train model
 selection_model = XGBClassifier()
 selection_model.fit(select_X_train, y_train)
 # eval model
 select_X_test = selection.transform(X_test)
 predictions = selection_model.predict(select_X_test)
 accuracy = accuracy_score(y_test, predictions)
 print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

My question is - why do I need this line in order to select the model features from the test set:

select_X_test = selection.transform(X_test)

Why can't I just select the top len(select_X_train) most important features from model.feature_importances_ and use this subset of the test set for prediction? When I tried doing so I got a low performing model that labeled almost every instance as true, but when I used selection.transform(X_test) I got a model with a way better performance (~70% precision and recall).

Thanks in advance!

Upvotes: 2

Views: 921

Answers (2)

m13op22
m13op22

Reputation: 2337

Why can't I just select the top len(select_X_train) most important features from model.feature_importances_ and use this subset of the test set for prediction?

Because choosing the top "len(select_X_train)" features like X_train[:, :len(select_X_train)] (which really should be X_train[:, :select_X_train.shape[1]] since len returns the number of rows) does not give you the top most important features! That's why your model performance was worse.

Why not? Because the features in X_train aren't sorted by importance (from highest to lowest), they're in the order you loaded them in.

How do you get the most important features from the SelectFromModel object? Well, you can use the get_support() method. Look at the case where thresh <= 0.13577053 or n=3.

print(selection.get_support())

# Returns the features to choose from the range(X_train.shape[1])
[False  True False False False  True False  True]

If you take the first "n=select_X_train.shape[1]" features, you take the first n features in your dataset. For X_test that would be

[[102.   30.8  26. ]
 [ 77.   33.3  24. ]
 [124.   35.4  34. ]
 [111.   30.1  23. ]
 [108.   30.8  21. ]]

To get the correct features, you'd have to do

print(X_test[:5, selection.get_support()]) # prints first 5 rows

[[ 90.   27.2  24. ]
 [181.   35.9  51. ]
 [152.   26.8  43. ]
 [ 93.   28.7  23. ]
 [125.   27.6  49. ]]

But that's what the transform function does for you. The line

select_X_test = selection.transform(X_test)

print("Transformed select_X_test")
print(select_X_test[:5,:])

Transformed select_X_test: 
[[ 90.   27.2  24. ]
 [181.   35.9  51. ]
 [152.   26.8  43. ]
 [ 93.   28.7  23. ]
 [125.   27.6  49. ]]

selection saves the features chosen when SelectFromModel is run on X_train and then selects those features when you apply to X_test. Hence the line

select_X_test = selection.transform(X_test)

Upvotes: 1

Ghassen Sultana
Ghassen Sultana

Reputation: 1402

I wanted to recreate the same behavior and did the pipeline as you mentionned.

the pipeline :

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
model = XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
thresholds = sorted(model.feature_importances_, reverse=True)

after training the first model i used the two methods to extract the features the first method as below :

for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model
    selection_model = XGBClassifier(random_state=7)
    selection_model.fit(select_X_train, y_train)
    # eval model
    select_X_test = selection.transform(X_test)
    predictions = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

which gave me the following accuracies : 67.32% => 71.26% => 71.26% => 74.80% => 74.41% => 71.26% => 71.65% => 74.02%.

after that i test to manually create the train and test set without the SelectFromModel

for thresh in range(8):
    mask_of_the_column_to_use = (model.feature_importances_.argsort() <= thresh)
    data_with_choosen_columns = X_train[:,model.feature_importances_.argsort() <= thresh].copy()
    selection_model = XGBClassifier(random_state=7)
    selection_model.fit(data_with_choosen_columns, y_train)
    # eval model
    select_X_test = X_test[:,model.feature_importances_.argsort() <= thresh].copy()
    predictions = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

which gave me the following accuracies : 67.32% => 64.57% => 69.69% => 68.90% => 69.69% => 70.47% => 70.08% => 74.02%

so there is a small variation in the accuracy. furthermore when we add columns we converge to the same results.

Upvotes: 1

Related Questions