hanna
hanna

Reputation: 86

caret:rfe get best performing variables for a certain size

I ran a rfe Model with around 400 variables and got the result that the optimal model uses 40 variables. However, plotting the standard deviations of the error based on cross validation shows that the 40 variable model performs only slightly better than a model with only 10 variables. That's why I'd like to go for the model with 10 variables. I would like to use the 10 variables which perform best for a ten- variable-model and train the model again.

How can I get the 10 variables which lead to the model performance shown in the rfe object?

Since I use rerank=TRUE, I cannot just pick the 10 highest ranked variables from varImp(rfeModel$fit) right? (Would this work if I was not using "rerank" ?)

I'm also struggling with the differences between the output from varImp(rfeModel$fit), varImp(rfeModel), pickVars(rfeModel$variables,40). What is the right way to get the best performing variables with regard to the size of interest?

The following example can be used:

data(BloodBrain)

x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)])
x <- x[, -findCorrelation(cor(x), .8)]
x <- as.data.frame(x)

set.seed(1)


rfProfile <- rfe(x, logBBB,
                 sizes = c(2, 5, 10, 20),
                 method="nnet",
                 maxit=10,
                 rfeControl(functions = caretFuncs,
                            returnResamp="all",
                            method="cv",
                            rerank=TRUE))

print(rfProfile), varImp(rfProfile$fit), varImp(rfProfile), pickVars(rfProfile$variables, rfProfile$optsize)

Upvotes: 3

Views: 2651

Answers (1)

topepo
topepo

Reputation: 14316

The simplest thing to do is to use the update function:

new_profile <- update(rfProfile, x = x, y = logBBB, size = 10) 

Internally, it uses this code:

selectedVars <- rfProfile$variables
bestVar <- rfProfile$control$functions$selectVar(selectedVars, 10)

Max

Upvotes: 4

Related Questions