Reputation: 51
Using XGBoost xgb.importance
an importance matrix can be printed showing variable importance values to classification as measured by Gain, Cover, and Frequency. Gain is the recommended indicator of variable importance.
Using caret resampling (repeatedcv, number=10, repeats =5), a particular tuning grid, and train method = "xgbTree"
, the caret varImp()
function shows the k-fold feature importance estimation scaled from 0-100%.
My question is does the caret varImp(xgbMod)
wrapper function use Gain or all some combination of Gain, Cover, and Frequency?
Upvotes: 5
Views: 2552
Reputation: 46978
One small clarification:
the caret varImp() function shows the k-fold feature importance estimation scaled from 0-100%.
caret estimates feature importance from the final model fitted, and not from the cross validations. The cross validations tell you the best hyper parameters (e.g gamma etc) to fit the model with.
It is Gain, not much documentation, I checked using an example:
library(caret)
data = MASS::Pima.tr
set.seed(111)
mdl = train(type ~ .,data=data,method="xgbTree",tuneLength=3,
trControl=trainControl(method="cv"))
You set scale=FALSE to set the raw values:
varImp(mdl,scale=FALSE)
xgbTree variable importance
Overall
glu 0.37953
age 0.19184
ped 0.16418
bmi 0.13755
npreg 0.06450
skin 0.04526
bp 0.01713
Compare with xgb.importance:
xgboost::xgb.importance(mdl$finalModel$feature_names,model=mdl$finalModel)
Feature Gain Cover Frequency
1: glu 0.37953480 0.17966683 0.16
2: age 0.19183994 0.17190387 0.17
3: ped 0.16417775 0.26768973 0.28
4: bmi 0.13755463 0.09755036 0.09
5: npreg 0.06450183 0.10811269 0.11
6: skin 0.04526090 0.11229235 0.12
7: bp 0.01713014 0.06278416 0.07
Upvotes: 5