Reputation: 3811
Using XGBoost Feature importance I get the feature importances for my dataframe X_train. My X_train initially had 49 features. XGBoost feature impotance tells me that out of these 49 features, what is the score of importance of each feature. Now I want to find out how many features to use in my machine learning model. Various thresholds are as mentioned in the thresholds array corresponding to each feature. I want to know what minimum threshold I should take to include features. Should I include all features above 0.3 or 0.4 score etc. However I am getting an error:
from numpy import sort
from sklearn.feature_selection import SelectFromModel
xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.08, n_jobs=-1).fit(X_train, y_train)
thresholds = sort(xgb_model.feature_importances_)
The thresholds for all the features are as below:
[IN]thresholds
[OUT] array([0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.00201289, 0.00362736, 0.0036676 , 0.00467797, 0.00532952,
0.00591741, 0.00630169, 0.00661084, 0.00737418, 0.00741502,
0.00748773, 0.00753344, 0.00773079, 0.00852909, 0.00859741,
0.00906814, 0.00929257, 0.00980796, 0.00986394, 0.01056027,
0.01154695, 0.01190695, 0.01203871, 0.01258377, 0.01301482,
0.01383268, 0.01390096, 0.02001457, 0.02699436, 0.03168892,
0.03543754, 0.03578222, 0.13946259, 0.48038903], dtype=float32)
Function to select only the most important features and create a dataframe select_X_train containing the same.
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(xgb_model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
I am getting below error:
----> 4 select_X_train = selection.transform(X_train)
KeyError: 'weight'
There is no column by name weight. How to resolve this error.
Expected output
Thresh=0.00201289, n=33, Accuracy: 77.95%
#33 features with threshold above 0.002
Thresh=0.00362736, n=34, Accuracy: 76.38%
#34 features with threshold above 0.003
Thresh=0.0036676 , n=35, Accuracy: 77.56%
#35 features with threshold above 0.003 and so on
So basically take each threshold and run XGBoost and calculate accuracy for all features with minimum threshold value score as specified. For e.g in first case, all features with at least 0.00201289 score will be considered for XGBoost, and accuracy will be calculated. Next features with at least 0.003 threshold and above will be considered and so on.
Upvotes: 1
Views: 839
Reputation: 48
I was following a similar tutorial and I successfully performed this feature selection based on threshold by downgrading to xgboost==0.90.
Also, to avoid nuisance warnings use XGClassifier(objective ='reg:squarederror')
Upvotes: 0