Reputation: 428
I have been looking for several feature selection methods and found about the feature selection with help of XGBoost from the following link (XGBoost feature importance and selection). I implemented the method for my case, and results were the following:
So, my question is the following, for this case how can I select the highest accuracy with a low number of features [n]? [The code can be found in the link]
Edit 1:
Thanks to @Mihai Petre, I managed to get it to work with code in his answer. I have another question, say I ran the code from the link and got the following:
Feature Importance results = [29.205832 5.0182242 0. 0. 0. 6.7736177 16.704327 18.75632 9.529003 14.012676 0. ]
Features = [ 0 7 6 9 8 5 1 10 4 3 2]
How can I remove the features that gave out zero feature importance and get the features with feature importance values?
Side Questions:
Upvotes: 0
Views: 1030
Reputation: 428
I managed to sort it out. Please find the code below:
To get the lowest number of features with the highest accuracy:
# Fit the model:
f_max = 8
f_min = 2
acc_max = accuracy
thresholds = np.sort(model_FS.feature_importances_)
obj_thresh = thresholds[0]
accuracy_list = []
for thresh in thresholds:
# select features using threshold:
selection = SelectFromModel(model_FS, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model:
selection_model = xgb.XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model:
select_X_test = selection.transform(X_test)
selection_model_pred = selection_model.predict(select_X_test)
selection_predictions = [round(value) for value in selection_model_pred]
accuracy = accuracy_score(y_true=y_test, y_pred=selection_predictions)
accuracy = accuracy * 100
print('Thresh= %.3f, n= %d, Accuracy: %.2f%%' % (thresh, select_X_train.shape[1], accuracy))
accuracy_list.append(accuracy)
if(select_X_train.shape[1] < f_max) and (select_X_train.shape[1] >= f_min) and (accuracy >= acc_max):
n_min = select_X_train.shape[1]
acc_max = accuracy
obj_thresh = thresh
# select features using threshold:
selection = SelectFromModel(model_FS, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model:
selection_model = xgb.XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model:
select_X_test = selection.transform(X_test)
selection_model_pred = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
selection_predictions = [round(value) for value in selection_model_pred]
accuracy = accuracy_score(y_true=y_test, y_pred=selection_predictions)
print("Selected: Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))
key_list = list(range(X_train.shape[1], 0, -1))
accuracy_dict = dict(zip(key_list, accuracy_list))
optimum_num_feat = n_min
print(optimum_num_feat)
# Printing out the features:
X_train = X_train.iloc[:, optimum_number_features]
X_test = X_test.iloc[:, optimum_number_features]
print('X Train FI: ')
print(X_train)
print('X Test FI: ')
print(X_test)
To get the features with importance values without zero importance values:
# Calculate feature importances
importances = model_FS.feature_importances_
print((model_FS.feature_importances_) * 100)
# Organising the feature importance in dictionary:
## The key value depends on your maximum number of features:
key_list = range(0, 11, 1)
feature_importance_dict = dict(zip(key_list, importances))
sort_feature_importance_dict = dict(sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True))
print('Feature Importnace Dictionary (Sorted): ', sort_feature_importance_dict)
# Removing the features that have value zero in feature importance:
filtered_feature_importance_dict = {x:y for x,y in sort_feature_importance_dict.items() if y!=0}
print('Filtered Feature Importnace Dictionary: ', filtered_feature_importance_dict)
f_indices = list(filtered_feature_importance_dict.keys())
f_indices = np.asarray(f_indices)
print(f_indices)
X_train = X_train.loc[:, f_indices]
X_test = X_test.loc[:, f_indices]
print('X Train FI: ')
print(X_train)
print('X Test FI: ')
print(X_test)
Upvotes: 0
Reputation: 36
Ok, so what the guy in your link is doing with
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
is to create a sorted array of thresholds and then he trains the XGBoost for every element of the thresholds
array.
From your question, I'm thinking you want to only select the 6th case, the one with the lowest number of features and highest accuracy. For this case, you'd want to do something like this:
selection = SelectFromModel(model, threshold=threshold[5], prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (threshold[5], select_X_train.shape[1], accuracy*100.0))
If you want to automate the whole thing, then you'd want to calculate the minimum n for which the accuracy is at its maximum inside that for loop, and it would look more or less like this:
n_min = *your maximum number of used features*
acc_max = 0
thresholds = sort(model.feature_importances_)
obj_thresh = thresholds[0]
for thresh in thresholds:
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
if(select_X_train.shape[1] < n_min) and (accuracy > acc_max):
n_min = select_X_train.shape[1]
acc_max = accuracy
obj_thresh = thresh
selection = SelectFromModel(model, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))
Upvotes: 1