Reputation: 21
I'm trying to use rfe
function from the caret
package in combination with PLS-DA model.
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] splines grid parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] mclust_4.4 Kendall_2.2 doBy_4.5-13 survival_2.37-7 statmod_1.4.20
[6] preprocessCore_1.26.1 sva_3.10.0 mgcv_1.8-4 nlme_3.1-119 corpcor_1.6.7
[11] car_2.0-22 reshape2_1.4.1 gplots_2.16.0 DMwR_0.4.1 mi_0.09-19
[16] arm_1.7-07 lme4_1.1-7 Matrix_1.1-5 MASS_7.3-37 randomForest_4.6-10
[21] plyr_1.8.1 pls_2.4-3 caret_6.0-41 ggplot2_1.0.0 lattice_0.20-29
[26] pcaMethods_1.54.0 Rcpp_0.11.4 Biobase_2.24.0 BiocGenerics_0.10.0
loaded via a namespace (and not attached):
[1] abind_1.4-0 bitops_1.0-6 boot_1.3-14 BradleyTerry2_1.0-5 brglm_0.5-9 caTools_1.17.1
[7] class_7.3-11 coda_0.16-1 codetools_0.2-10 colorspace_1.2-4 compiler_3.1.1 digest_0.6.8
[13] e1071_1.6-4 foreach_1.4.2 foreign_0.8-62 gdata_2.13.3 gtable_0.1.2 gtools_3.4.1
[19] iterators_1.0.7 KernSmooth_2.23-13 minqa_1.2.4 munsell_0.4.2 nloptr_1.0.4 nnet_7.3-8
[25] proto_0.3-10 quantmod_0.4-3 R2WinBUGS_2.1-19 ROCR_1.0-5 rpart_4.1-8 scales_0.2.4
[31] stringr_0.6.2 tools_3.1.1 TTR_0.22-0 xts_0.9-7 zoo_1.7-11
To practice I ran the following example using the iris data.
data(iris)
subsets <- 2:4
ctrl <- rfeControl(functions = caretFuncs, method = 'cv', number = 5, verbose=TRUE)
trctrl <- trainControl(method='cv', number=5)
mod <- rfe(Species ~., data = iris, sizes = subsets, rfeControl = ctrl, trControl = trctrl, method = 'pls')
All works well.
mod
Recursive feature selection
Outer resampling method: Cross-Validated (5 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
2 0.6533 0.48 0.02981 0.04472
3 0.8067 0.71 0.06412 0.09618 *
4 0.7867 0.68 0.07674 0.11511
The top 3 variables (out of 3):
Sepal.Width, Petal.Length, Sepal.Length
However, if I try to replicate this on data I have generated I get the following error. I can't work out why! If you have any ideas I'd be really interested in hearing them.
x <- as.data.frame(matrix(0,10,10))
for(i in 1:9) {x[,i] <- rnorm(10,0,1)}
x[,10] <- as.factor(rbinom(10, 1, 0.5))
subsets <- 2:9
ctrl <- rfeControl(functions = caretFuncs, method = 'cv', number = 5, verbose=TRUE)
trctrl <- trainControl(method='cv', number=5)
mod <- rfe(V10 ~., data = x, sizes = subsets, rfeControl = ctrl, trControl = trctrl, method = 'pls')
Error in { : task 1 failed - "undefined columns selected"
In addition: Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
4: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
5: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Upvotes: 1
Views: 766
Reputation: 21
I have worked out (after a lot of to-ing and fro-ing) that levels of the response factor variable have to be characters to combine PLS-DA with RFE in caret.
For example...
x <- data.frame(matrix(rnorm(1000),100,10))
y <- as.factor(c(rep('Positive',40), rep('Negative',60)))
data <- data.frame(x,y)
subsets <- 2:9
ctrl <- rfeControl(functions = caretFuncs, method = 'cv', number = 5, verbose=TRUE)
trctrl <- trainControl(method='cv', number=5)
mod <- rfe(y ~., data, sizes = subsets, rfeControl = ctrl, trControl = trctrl, method = 'pls')
Upvotes: 1