CherryGarcia
CherryGarcia

Reputation: 33

Identifying continuous and categorical data using data.table

DATA = data.table(col1 = rbinom(10,10,.5),
                  col2 = rnorm(10),
                  col3 = rbinom(10,1,.5),
                  col4 = rbinom(10,10,.5),
                  col5 = rbinom(10,1,.5))

I have a large data set where continuous variables are either of class integer or numeric and categorical variables are of class integer. Hence I am looking for other rule that would allow me to distinguish between continuous and categorical variables. All of the categorical variables are 0-1 as presented in the table.

As the data set is large I would prefer to use the data.table package.

Upvotes: 0

Views: 308

Answers (2)

R Yoda
R Yoda

Reputation: 8770

I would like to propose a more generic solution:

Why? What shall happen if you have dirty data in the categorical columns (e. g. a value > 1)?

If your dataset is big enough you could try to tabulate (table) each column and count the distinct values. A low number of distinct values is a good indicator for a categorical variable:

library(data.table)

set.seed(123)

size = 1000

data = data.table(
  col1 = rbinom(size, 10, .5),
  col2 = rnorm(size),
  col3 = rbinom(size, 1, .5),
  col4 = rbinom(size, 10, .5),
  col5 = rbinom(size, 1, .5)
)

lapply(data, function(x) length(table(x)))
# $col1
# [1] 11
# 
# $col2
# [1] 1000
# 
# $col3
# [1] 2
# 
# $col4
# [1] 11
# 
# $col5
# [1] 2

Here col2 is most likely continuous all the other columns seem to be categorical.

You could use an absolute (max. number of different values) or relative (number of different values divided by the number of observations/rows) value to split the result into continuous and categorical...

If also want to know the min and max values per column you could use:

lapply(data, function(x) c(distincts = length(table(x)), min = min(x), msx = max(x) ))

to get

$col1
distincts       min       msx 
       11         0        10 

$col2
  distincts         min         msx 
1000.000000   -2.809775    3.390371 

$col3
distincts       min       msx 
        2         0         1 

$col4
distincts       min       msx 
       11         0        10 

$col5
distincts       min       msx 
        2         0         1 

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 40161

You can do:

apply(DATA, 2, function(x) ifelse(all(x %in% 0:1), 1, 0))

Or:

apply(DATA, 2, function(x) all(x %in% 0:1))

Upvotes: 1

Related Questions