Reputation: 5183
Say I have about 500 variables available, and I'm trying to do variable selection for my model ( response is binary )
I am planning on doing some kind of corr analysis for all continuous, then do categorical after.
Since there's a lot of variables involved, I can't do it manually.
Is there a function that I can use ? or maybe a module ?
Upvotes: 0
Views: 6189
Reputation: 263471
Create a function that returns logical for number of unique value less than some fraction of the total and I'm picking 5%:
discreteL <- function(x) length(unique(x)) < 0.05*length(x)
Now sapply
it (with negation for continuous variables) to the data.frame:
> str( iris[ , !sapply(iris, discreteL)] )
'data.frame': 150 obs. of 4 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
You could have picked a particular number, say 15, as your criterion I suppose.
I should make clear that the statistical theory suggests this procedure to be dangerous for the purpose outlined. Just picking the variables that are most correlated with a binary response is not well-supported. There have been many studies that show better approaches to variable selection. So my answer is really only how to do the separation, but not an endorsement of the overall plan that you have vaguely described.
Upvotes: 1
Reputation: 15458
You can use str(df)
to see which columns are factors and which are not (df is your dataframe). For example, for data iris in R:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Or, you can use lapply(iris,class)
$Sepal.Length
[1] "numeric"
$Sepal.Width
[1] "numeric"
$Petal.Length
[1] "numeric"
$Petal.Width
[1] "numeric"
$Species
[1] "factor"
Upvotes: 1
Reputation: 23938
I'm using iris
data set avaialbe in R
. Then
sapply(iris, is.factor)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
FALSE FALSE FALSE FALSE TRUE
will tell you weather your columns are factor or not. So using
iris[ ,sapply(iris, is.factor)]
you can pick factor columns only. And
iris[ ,!sapply(iris, is.factor)]
will give you those columns which are not factor. You can also use is.numeric
, is.character
and different other versions.
Upvotes: 5