caret:rfe get best performing variables for a certain size

Question

I ran a rfe Model with around 400 variables and got the result that the optimal model uses 40 variables. However, plotting the standard deviations of the error based on cross validation shows that the 40 variable model performs only slightly better than a model with only 10 variables. That's why I'd like to go for the model with 10 variables. I would like to use the 10 variables which perform best for a ten- variable-model and train the model again.

How can I get the 10 variables which lead to the model performance shown in the rfe object?

Since I use rerank=TRUE, I cannot just pick the 10 highest ranked variables from varImp(rfeModel$fit) right? (Would this work if I was not using "rerank" ?)

I'm also struggling with the differences between the output from varImp(rfeModel$fit), varImp(rfeModel), pickVars(rfeModel$variables,40). What is the right way to get the best performing variables with regard to the size of interest?

The following example can be used:

data(BloodBrain)

x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)])
x <- x[, -findCorrelation(cor(x), .8)]
x <- as.data.frame(x)

set.seed(1)


rfProfile <- rfe(x, logBBB,
                 sizes = c(2, 5, 10, 20),
                 method="nnet",
                 maxit=10,
                 rfeControl(functions = caretFuncs,
                            returnResamp="all",
                            method="cv",
                            rerank=TRUE))

print(rfProfile), varImp(rfProfile$fit), varImp(rfProfile), pickVars(rfProfile$variables, rfProfile$optsize)

caret:rfe get best performing variables for a certain size

Answers (1)

Related Questions