noob
noob

Reputation: 3811

Keyerror : weight. Implement XGBoost only on features selected by feature_importance

Using XGBoost Feature importance I get the feature importances for my dataframe X_train. My X_train initially had 49 features. XGBoost feature impotance tells me that out of these 49 features, what is the score of importance of each feature. Now I want to find out how many features to use in my machine learning model. Various thresholds are as mentioned in the thresholds array corresponding to each feature. I want to know what minimum threshold I should take to include features. Should I include all features above 0.3 or 0.4 score etc. However I am getting an error:

from numpy import sort
from sklearn.feature_selection import SelectFromModel
xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.08, n_jobs=-1).fit(X_train, y_train)

thresholds = sort(xgb_model.feature_importances_)

The thresholds for all the features are as below:

[IN]thresholds
[OUT] array([0.        , 0.        , 0.        , 0.        , 0.        ,
   0.        , 0.        , 0.        , 0.        , 0.        ,
   0.        , 0.        , 0.        , 0.        , 0.        ,
   0.00201289, 0.00362736, 0.0036676 , 0.00467797, 0.00532952,
   0.00591741, 0.00630169, 0.00661084, 0.00737418, 0.00741502,
   0.00748773, 0.00753344, 0.00773079, 0.00852909, 0.00859741,
   0.00906814, 0.00929257, 0.00980796, 0.00986394, 0.01056027,
   0.01154695, 0.01190695, 0.01203871, 0.01258377, 0.01301482,
   0.01383268, 0.01390096, 0.02001457, 0.02699436, 0.03168892,
   0.03543754, 0.03578222, 0.13946259, 0.48038903], dtype=float32)

Function to select only the most important features and create a dataframe select_X_train containing the same.

for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(xgb_model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
# train model
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    # eval model
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in y_pred]
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

I am getting below error:

----> 4     select_X_train = selection.transform(X_train)
KeyError: 'weight'

There is no column by name weight. How to resolve this error.

Expected output

Thresh=0.00201289, n=33, Accuracy: 77.95% 
#33 features with threshold above 0.002

Thresh=0.00362736, n=34, Accuracy: 76.38%
#34 features with threshold above 0.003

Thresh=0.0036676 , n=35, Accuracy: 77.56%
#35 features with threshold above 0.003 and so on

So basically take each threshold and run XGBoost and calculate accuracy for all features with minimum threshold value score as specified. For e.g in first case, all features with at least 0.00201289 score will be considered for XGBoost, and accuracy will be calculated. Next features with at least 0.003 threshold and above will be considered and so on.

Upvotes: 1

Views: 839

Answers (1)

Gustavo Bertoli
Gustavo Bertoli

Reputation: 48

I was following a similar tutorial and I successfully performed this feature selection based on threshold by downgrading to xgboost==0.90.

Also, to avoid nuisance warnings use XGClassifier(objective ='reg:squarederror')

Upvotes: 0

Related Questions