WDpad159
WDpad159

Reputation: 428

How to get the highest accuracy with low number of selected features using xgboost?

I have been looking for several feature selection methods and found about the feature selection with help of XGBoost from the following link (XGBoost feature importance and selection). I implemented the method for my case, and results were the following:

So, my question is the following, for this case how can I select the highest accuracy with a low number of features [n]? [The code can be found in the link]

Edit 1:

Thanks to @Mihai Petre, I managed to get it to work with code in his answer. I have another question, say I ran the code from the link and got the following:

Feature Importance results = [29.205832   5.0182242  0.         0.         0. 6.7736177 16.704327  18.75632    9.529003  14.012676   0.       ]
Features = [ 0  7  6  9  8  5  1 10  4  3  2]

How can I remove the features that gave out zero feature importance and get the features with feature importance values?

Side Questions:

  1. I am trying to find the best feature selection that involves using specific classification model and best features that help to give high accuracy, say, for example, using KNN classifier and would like to find the best features that give out high accuracy. What feature selection would be appropriate to use?
  2. When Implementing multiple classification models, is it best to do feature selection for each classification model or you need to do feature selection once and then use the selected features to multiple classification models?

Upvotes: 0

Views: 1030

Answers (2)

WDpad159
WDpad159

Reputation: 428

I managed to sort it out. Please find the code below:

To get the lowest number of features with the highest accuracy:

# Fit the model:
f_max = 8
f_min = 2
acc_max = accuracy
thresholds = np.sort(model_FS.feature_importances_)
obj_thresh = thresholds[0]
accuracy_list = []
for thresh in thresholds:
    # select features using threshold:
    selection = SelectFromModel(model_FS, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model:
    selection_model = xgb.XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    # eval model:
    select_X_test = selection.transform(X_test)
    selection_model_pred = selection_model.predict(select_X_test)
    selection_predictions = [round(value) for value in selection_model_pred]
    accuracy = accuracy_score(y_true=y_test, y_pred=selection_predictions)
    accuracy = accuracy * 100
    print('Thresh= %.3f, n= %d, Accuracy: %.2f%%' % (thresh, select_X_train.shape[1], accuracy))
    accuracy_list.append(accuracy)
    if(select_X_train.shape[1] < f_max) and (select_X_train.shape[1] >= f_min) and (accuracy >= acc_max):
        n_min = select_X_train.shape[1]
        acc_max = accuracy
        obj_thresh = thresh
# select features using threshold:
selection = SelectFromModel(model_FS, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model:
selection_model = xgb.XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model:
select_X_test = selection.transform(X_test)
selection_model_pred = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
selection_predictions = [round(value) for value in selection_model_pred]
accuracy = accuracy_score(y_true=y_test, y_pred=selection_predictions)
print("Selected: Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))
key_list = list(range(X_train.shape[1], 0, -1))
accuracy_dict = dict(zip(key_list, accuracy_list))
optimum_num_feat = n_min
print(optimum_num_feat)

# Printing out the features:
X_train = X_train.iloc[:, optimum_number_features]
X_test = X_test.iloc[:, optimum_number_features]

print('X Train FI: ')
print(X_train)
print('X Test FI: ')
print(X_test)

To get the features with importance values without zero importance values:

# Calculate feature importances
importances = model_FS.feature_importances_
print((model_FS.feature_importances_) * 100)

# Organising the feature importance in dictionary:
## The key value depends on your maximum number of features:
key_list = range(0, 11, 1)
feature_importance_dict = dict(zip(key_list, importances))
sort_feature_importance_dict = dict(sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True))
print('Feature Importnace Dictionary (Sorted): ', sort_feature_importance_dict)

# Removing the features that have value zero in feature importance:
filtered_feature_importance_dict = {x:y for x,y in sort_feature_importance_dict.items() if y!=0}
print('Filtered Feature Importnace Dictionary: ', filtered_feature_importance_dict)
f_indices = list(filtered_feature_importance_dict.keys())
f_indices = np.asarray(f_indices)
print(f_indices)

X_train = X_train.loc[:, f_indices]
X_test = X_test.loc[:, f_indices]

print('X Train FI: ')
print(X_train)
print('X Test FI: ')
print(X_test)

Upvotes: 0

Mihai Petre
Mihai Petre

Reputation: 36

Ok, so what the guy in your link is doing with

thresholds = sort(model.feature_importances_)
for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    # eval model
    select_X_test = selection.transform(X_test)
    predictions = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

is to create a sorted array of thresholds and then he trains the XGBoost for every element of the thresholds array.

From your question, I'm thinking you want to only select the 6th case, the one with the lowest number of features and highest accuracy. For this case, you'd want to do something like this:

selection = SelectFromModel(model, threshold=threshold[5], prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (threshold[5], select_X_train.shape[1], accuracy*100.0))

If you want to automate the whole thing, then you'd want to calculate the minimum n for which the accuracy is at its maximum inside that for loop, and it would look more or less like this:

n_min = *your maximum number of used features*
acc_max = 0
thresholds = sort(model.feature_importances_)
obj_thresh = thresholds[0]
for thresh in thresholds:
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    select_X_test = selection.transform(X_test)
    predictions = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, predictions)
    if(select_X_train.shape[1] < n_min) and (accuracy > acc_max):
        n_min = select_X_train.shape[1]
        acc_max = accuracy
        obj_thresh = thresh

selection = SelectFromModel(model, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))

Upvotes: 1

Related Questions