A Doe
A Doe

Reputation: 301

Convert continuous variables to binary variables

I am trying to convert continuous variables into binary columns of categorical variables in R with the cut function. The code is

    xyz=rnorm(20,3,1)
    xcut=cut(xyz,breaks=c(2,3))

This converts xyz to categorical variables but I want to have three binary columns where the column names are '<2', '2-3' and '>3' and say, if xyz[1] is 1.5, then the first row values are 1, 0 and 0, and I need this for all 20 values of xyz. I didn't want to use for and if loops to create this 20x3 matrix, I could do it with xyz in a numerical fashion already. I am wondering if there is a shorter way to do that?

Upvotes: 1

Views: 5499

Answers (2)

Andrii
Andrii

Reputation: 3043

One of the solutions is to use unsupervised discretization. It based entirely on the observed distribution of the continuous attribute. Here are 2 functions with example of usage:

# 1. Functions

# 1.1. Equal-width discretization for a single attribute
disc_width <- function(v, k = 5) {
  w <- diff(r <- range(v)) / k
  c(r[1], seq(r[1] + w, r[2] - w, w), r[2])
}

# 1.2. Equal-frequency discretization for a single attribute
disc_freq <- function(v, k = 5) {
  v <- v[!is.na(v)]
  r <- range(v)
  f <- unique(quantile(v, seq(1/k, 1-1/k, 1/k))) 
  c(r[1], f, r[2])
}

# 2. Usage

# 2.1. Feature
x <- mtcars$mpg

# 2.2. Range of feature 'x'
range(x)

# 2.3. Equal-width discretization
disc_width(x, 4)

# 2.4. Equal-frequency discretization
disc_freq(x, 5)

Upvotes: 2

akrun
akrun

Reputation: 886938

We can use table

xcut <- cut(xyz,breaks=c(-Inf,2,3, Inf), labels = c("<2", "2-3", ">3"))
table(seq_along(xcut), xcut)

data

set.seed(24)
xyz <- rnorm(20,3,1)

Upvotes: 4

Related Questions