Reputation: 28189
I have a data set with NAs
sprinkled generously throughout.
In addition it has columns that need to be factors()
.
I am using the rfe()
function from the caret
package to select variables.
It seems the functions=
argument in rfe()
using lmFuncs
works for the data with NAs but NOT on factor variables, while the rfFuncs
works for factor variables but NOT NAs.
Any suggestions for dealing with this?
I tried model.matrix()
but it seems to just cause more problems.
Upvotes: 3
Views: 1646
Reputation: 22588
Because of inconsistent behavior on these points between packages, not to mention the extra trickiness when going to more "meta" packages like caret
, I always find it easier to deal with NAs and factor variables up front, before I do any machine learning.
model.matrix()
. It will let you generate a series of "dummy" features for the different levels of the factor. The typical usage is something like this:> dat = data.frame(x=factor(rep(1:3, each=5)))
> dat$x
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> model.matrix(~ x - 1, data=dat)
x1 x2 x3
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 0 1 0
7 0 1 0
8 0 1 0
9 0 1 0
10 0 1 0
11 0 0 1
12 0 0 1
13 0 0 1
14 0 0 1
15 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
Also, just in case you haven't (although it sounds like you have), the caret
vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html
Upvotes: 4