Reputation: 23
recently I've been working on a XGBoost model, and using it for feature selection based on the feature importance scores (https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/)
This technique builds models iteratively based on the most important features:
The code of building of the models iteratively (taken from the link attached at the top):
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = MyXGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_, reverse=True)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
My question is - why do I need this line in order to select the model features from the test set:
select_X_test = selection.transform(X_test)
Why can't I just select the top len(select_X_train) most important features from model.feature_importances_ and use this subset of the test set for prediction? When I tried doing so I got a low performing model that labeled almost every instance as true, but when I used selection.transform(X_test) I got a model with a way better performance (~70% precision and recall).
Thanks in advance!
Upvotes: 2
Views: 921
Reputation: 2337
Why can't I just select the top len(select_X_train) most important features from model.feature_importances_ and use this subset of the test set for prediction?
Because choosing the top "len(select_X_train)" features like X_train[:, :len(select_X_train)]
(which really should be X_train[:, :select_X_train.shape[1]]
since len
returns the number of rows) does not give you the top most important features! That's why your model performance was worse.
Why not? Because the features in X_train
aren't sorted by importance (from highest to lowest), they're in the order you loaded them in.
How do you get the most important features from the SelectFromModel
object? Well, you can use the get_support()
method. Look at the case where thresh <= 0.13577053
or n=3
.
print(selection.get_support())
# Returns the features to choose from the range(X_train.shape[1])
[False True False False False True False True]
If you take the first "n=select_X_train.shape[1]
" features, you take the first n
features in your dataset. For X_test
that would be
[[102. 30.8 26. ]
[ 77. 33.3 24. ]
[124. 35.4 34. ]
[111. 30.1 23. ]
[108. 30.8 21. ]]
To get the correct features, you'd have to do
print(X_test[:5, selection.get_support()]) # prints first 5 rows
[[ 90. 27.2 24. ]
[181. 35.9 51. ]
[152. 26.8 43. ]
[ 93. 28.7 23. ]
[125. 27.6 49. ]]
But that's what the transform
function does for you. The line
select_X_test = selection.transform(X_test)
print("Transformed select_X_test")
print(select_X_test[:5,:])
Transformed select_X_test:
[[ 90. 27.2 24. ]
[181. 35.9 51. ]
[152. 26.8 43. ]
[ 93. 28.7 23. ]
[125. 27.6 49. ]]
selection
saves the features chosen when SelectFromModel
is run on X_train
and then selects those features when you apply to X_test
. Hence the line
select_X_test = selection.transform(X_test)
Upvotes: 1
Reputation: 1402
I wanted to recreate the same behavior and did the pipeline as you mentionned.
the pipeline :
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
model = XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
thresholds = sorted(model.feature_importances_, reverse=True)
after training the first model i used the two methods to extract the features the first method as below :
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier(random_state=7)
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
which gave me the following accuracies : 67.32% => 71.26% => 71.26% => 74.80% => 74.41% => 71.26% => 71.65% => 74.02%.
after that i test to manually create the train and test set without the SelectFromModel
for thresh in range(8):
mask_of_the_column_to_use = (model.feature_importances_.argsort() <= thresh)
data_with_choosen_columns = X_train[:,model.feature_importances_.argsort() <= thresh].copy()
selection_model = XGBClassifier(random_state=7)
selection_model.fit(data_with_choosen_columns, y_train)
# eval model
select_X_test = X_test[:,model.feature_importances_.argsort() <= thresh].copy()
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
which gave me the following accuracies : 67.32% => 64.57% => 69.69% => 68.90% => 69.69% => 70.47% => 70.08% => 74.02%
so there is a small variation in the accuracy. furthermore when we add columns we converge to the same results.
Upvotes: 1