Reputation: 5183

Is there an easy way to separate categorical vs continuous variables into two dataset in R

Say I have about 500 variables available, and I'm trying to do variable selection for my model ( response is binary )

I am planning on doing some kind of corr analysis for all continuous, then do categorical after.

Since there's a lot of variables involved, I can't do it manually.

Is there a function that I can use ? or maybe a module ?

Upvotes: 0

Answers (3)

IRTFM

Reputation: 263471

Create a function that returns logical for number of unique value less than some fraction of the total and I'm picking 5%:

 discreteL <- function(x) length(unique(x)) < 0.05*length(x)

Now sapply it (with negation for continuous variables) to the data.frame:

 > str( iris[ , !sapply(iris, discreteL)] )
'data.frame':   150 obs. of  4 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

You could have picked a particular number, say 15, as your criterion I suppose.

I should make clear that the statistical theory suggests this procedure to be dangerous for the purpose outlined. Just picking the variables that are most correlated with a binary response is not well-supported. There have been many studies that show better approaches to variable selection. So my answer is really only how to do the separation, but not an endorsement of the overall plan that you have vaguely described.

Upvotes: 1

Metrics

Reputation: 15458

You can use str(df) to see which columns are factors and which are not (df is your dataframe). For example, for data iris in R:

str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Or, you can use lapply(iris,class)

$Sepal.Length
[1] "numeric"

$Sepal.Width
[1] "numeric"

$Petal.Length
[1] "numeric"

$Petal.Width
[1] "numeric"

$Species
[1] "factor"

Upvotes: 1

MYaseen208

Reputation: 23938

I'm using iris data set avaialbe in R. Then

sapply(iris, is.factor)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
       FALSE        FALSE        FALSE        FALSE         TRUE

will tell you weather your columns are factor or not. So using

iris[ ,sapply(iris, is.factor)]

you can pick factor columns only. And

iris[ ,!sapply(iris, is.factor)]

will give you those columns which are not factor. You can also use is.numeric, is.character and different other versions.

Upvotes: 5

Is there an easy way to separate categorical vs continuous variables into two dataset in R

Answers (3)

Related Questions