Reputation: 33
DATA = data.table(col1 = rbinom(10,10,.5),
col2 = rnorm(10),
col3 = rbinom(10,1,.5),
col4 = rbinom(10,10,.5),
col5 = rbinom(10,1,.5))
I have a large data set where continuous variables are either of class integer
or numeric
and categorical variables are of class integer
. Hence I am looking for other rule that would allow me to distinguish between continuous and categorical variables. All of the categorical variables are 0-1
as presented in the table.
As the data set is large I would prefer to use the data.table
package.
Upvotes: 0
Views: 308
Reputation: 8770
I would like to propose a more generic solution:
Why? What shall happen if you have dirty data in the categorical columns (e. g. a value > 1)?
If your dataset is big enough you could try to tabulate (table
) each column and count the distinct values. A low number of distinct values is a good indicator for a categorical variable:
library(data.table)
set.seed(123)
size = 1000
data = data.table(
col1 = rbinom(size, 10, .5),
col2 = rnorm(size),
col3 = rbinom(size, 1, .5),
col4 = rbinom(size, 10, .5),
col5 = rbinom(size, 1, .5)
)
lapply(data, function(x) length(table(x)))
# $col1
# [1] 11
#
# $col2
# [1] 1000
#
# $col3
# [1] 2
#
# $col4
# [1] 11
#
# $col5
# [1] 2
Here col2 is most likely continuous all the other columns seem to be categorical.
You could use an absolute (max. number of different values) or relative (number of different values divided by the number of observations/rows) value to split the result into continuous and categorical...
If also want to know the min and max values per column you could use:
lapply(data, function(x) c(distincts = length(table(x)), min = min(x), msx = max(x) ))
to get
$col1
distincts min msx
11 0 10
$col2
distincts min msx
1000.000000 -2.809775 3.390371
$col3
distincts min msx
2 0 1
$col4
distincts min msx
11 0 10
$col5
distincts min msx
2 0 1
Upvotes: 1
Reputation: 40161
You can do:
apply(DATA, 2, function(x) ifelse(all(x %in% 0:1), 1, 0))
Or:
apply(DATA, 2, function(x) all(x %in% 0:1))
Upvotes: 1