Reputation: 431
I have a question regarding the rfe
function from the caret
library. On the caret-homepage link they give the following RFE algorithm:
algorithm
For this example I am using the rfe
function with 3-fold cross-validation and the train function with a linear-SVM and 5-fold cross-validation.
library(kernlab)
library(caret)
data(iris)
# parameters for the tune function, used for fitting the svm
trControl <- trainControl(method = "cv", number = 5)
# parameters for the RFE function
rfeControl <- rfeControl(functions = caretFuncs, method = "cv",
number= 4, verbose = FALSE )
rf1 <- rfe(as.matrix(iris[,1:4]), as.factor(iris[,5]) ,sizes = c( 2,3) ,
rfeControl = rfeControl, trControl = trControl, method = "svmLinear")
rfe
would split the data (150 samples) into 3 foldstrain
function would be run on the training-set (100 samples) with 5 fold cross validation to tune the model parameters - with subsequent RFE. What confuses me is that when I take a look on the results of the rfe
function:
> lapply(rf1$control$index, length)
$Fold1
[1] 100
$Fold2
[1] 101
$Fold3
[1] 99
> lapply(rf1$fit$control$index, length)
$Fold1
[1] 120
$Fold2
[1] 120
$Fold3
[1] 120
$Fold4
[1] 120
$Fold5
[1] 120
From that it appears that the size of the training sets from the 5-fold cv is 120 samples when I would expect a size of 80. ??
So it would be great if someone could clarify how rfe and train work together.
Cheers
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pROC_1.5.4 e1071_1.6-1 class_7.3-5 caret_5.15-048
[5] foreach_1.4.0 cluster_1.14.3 plyr_1.7.1 reshape2_1.2.1
[9] lattice_0.20-10 kernlab_0.9-15
loaded via a namespace (and not attached):
[1] codetools_0.2-8 compiler_2.15.1 grid_2.15.1 iterators_1.0.6
[5] stringr_0.6.1 tools_2.15.1
Upvotes: 9
Views: 3838
Reputation: 11
The problem here is that lapply(rf1$fit$control$index, length)
does not store what we think it does.
For me to understand that it was necessary to look into the code. What happens there is the following:
When you call rfe
the whole data is passed to the nominalRfeWorkflow
.
In nominalRfeWorkflow
, the train and test data splitted according to rfeControl
(in our example 3 times according to the 3-folded CV rule) is passed to rfeIter
.
These splits we can find in our result under rf1$control$index
.
In rfeIter
the ~100 training samples (our example) are used to find the final variables (which is the output of that function).
As I understand it, the ~50 test samples (our example) are used to calculate the performance for the different variable sets but they are only stored as external performance but not used to select the final variables.
For selecting these the performance estimates of the 5 fold cross validation are used.
But we cannot find these indices in the final result returned by rfe
.
If we really need them, we need to fetch them from fitObject$control$index
in rfeIter
, return them to nominalRfeWorkflow
, then to rfe
and from there in the resulting rfe
-Class object returned by rfe
.
So what is stored in lapply(rf1$fit$control$index, length)
? - When rfe
found the best variables the final model fit is created with the best variables and the full reference data (150). rf1$fit
is created in rfe
as follows:
fit <- rfeControl$functions$fit(x[, bestVar, drop = FALSE],
y,
first = FALSE,
last = TRUE,
...)
This function is again runs the train
function and does a final cross validation with the full reference data, the final feature set and trControl
given via the ellipses (...
).
Since our trControl
is supposed to do 5 fold CV it is thus correct that lapply(rf1$fit$control$index, length)
returns 120 since we have to calculate 150/5*4=120.
Upvotes: 1