How to get the highest accuracy with low number of selected features using xgboost?

Question

I have been looking for several feature selection methods and found about the feature selection with help of XGBoost from the following link (XGBoost feature importance and selection). I implemented the method for my case, and results were the following:

Thresh= 0.000, n= 11, Accuracy: 55.56%
Thresh= 0.000, n= 11, Accuracy: 55.56%
Thresh= 0.000, n= 11, Accuracy: 55.56%
Thresh= 0.000, n= 11, Accuracy: 55.56%
Thresh= 0.097, n= 7, Accuracy: 55.56%
Thresh= 0.105, n= 6, Accuracy: 55.56%
Thresh= 0.110, n= 5, Accuracy: 50.00%
Thresh= 0.114, n= 4, Accuracy: 50.00%
Thresh= 0.169, n= 3, Accuracy: 44.44%
Thresh= 0.177, n= 2, Accuracy: 38.89%
Thresh= 0.228, n= 1, Accuracy: 33.33%

So, my question is the following, for this case how can I select the highest accuracy with a low number of features [n]? [The code can be found in the link]

Edit 1:

Thanks to @Mihai Petre, I managed to get it to work with code in his answer. I have another question, say I ran the code from the link and got the following:

Feature Importance results = [29.205832   5.0182242  0.         0.         0. 6.7736177 16.704327  18.75632    9.529003  14.012676   0.       ]
Features = [ 0  7  6  9  8  5  1 10  4  3  2]

Thresh= 0.000, n= 11, Accuracy: 38.89%
Thresh= 0.000, n= 11, Accuracy: 38.89%
Thresh= 0.000, n= 11, Accuracy: 38.89%
Thresh= 0.000, n= 11, Accuracy: 38.89%
Thresh= 0.050, n= 7, Accuracy: 38.89%
Thresh= 0.068, n= 6, Accuracy: 38.89%
Thresh= 0.095, n= 5, Accuracy: 33.33%
Thresh= 0.140, n= 4, Accuracy: 38.89%
Thresh= 0.167, n= 3, Accuracy: 33.33%
Thresh= 0.188, n= 2, Accuracy: 38.89%
Thresh= 0.292, n= 1, Accuracy: 38.89%

How can I remove the features that gave out zero feature importance and get the features with feature importance values?

Side Questions:

I am trying to find the best feature selection that involves using specific classification model and best features that help to give high accuracy, say, for example, using KNN classifier and would like to find the best features that give out high accuracy. What feature selection would be appropriate to use?
When Implementing multiple classification models, is it best to do feature selection for each classification model or you need to do feature selection once and then use the selected features to multiple classification models?

Mihai Petre · Accepted Answer

Ok, so what the guy in your link is doing with

thresholds = sort(model.feature_importances_)
for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    # eval model
    select_X_test = selection.transform(X_test)
    predictions = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

is to create a sorted array of thresholds and then he trains the XGBoost for every element of the thresholds array.

From your question, I'm thinking you want to only select the 6th case, the one with the lowest number of features and highest accuracy. For this case, you'd want to do something like this:

selection = SelectFromModel(model, threshold=threshold[5], prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (threshold[5], select_X_train.shape[1], accuracy*100.0))

If you want to automate the whole thing, then you'd want to calculate the minimum n for which the accuracy is at its maximum inside that for loop, and it would look more or less like this:

n_min = *your maximum number of used features*
acc_max = 0
thresholds = sort(model.feature_importances_)
obj_thresh = thresholds[0]
for thresh in thresholds:
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    select_X_test = selection.transform(X_test)
    predictions = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, predictions)
    if(select_X_train.shape[1] < n_min) and (accuracy > acc_max):
        n_min = select_X_train.shape[1]
        acc_max = accuracy
        obj_thresh = thresh

selection = SelectFromModel(model, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))

How to get the highest accuracy with low number of selected features using xgboost?

Answers (2)

Related Questions