Reputation: 301
I am trying to convert continuous variables into binary columns of categorical variables in R with the cut
function. The code is
xyz=rnorm(20,3,1)
xcut=cut(xyz,breaks=c(2,3))
This converts xyz
to categorical variables but I want to have three binary columns where the column names are '<2', '2-3' and '>3' and say, if xyz[1]
is 1.5, then the first row values are 1, 0 and 0, and I need this for all 20 values of xyz
. I didn't want to use for and if loops to create this 20x3 matrix, I could do it with xyz
in a numerical fashion already. I am wondering if there is a shorter way to do that?
Upvotes: 1
Views: 5499
Reputation: 3043
One of the solutions is to use unsupervised discretization. It based entirely on the observed distribution of the continuous attribute. Here are 2 functions with example of usage:
# 1. Functions
# 1.1. Equal-width discretization for a single attribute
disc_width <- function(v, k = 5) {
w <- diff(r <- range(v)) / k
c(r[1], seq(r[1] + w, r[2] - w, w), r[2])
}
# 1.2. Equal-frequency discretization for a single attribute
disc_freq <- function(v, k = 5) {
v <- v[!is.na(v)]
r <- range(v)
f <- unique(quantile(v, seq(1/k, 1-1/k, 1/k)))
c(r[1], f, r[2])
}
# 2. Usage
# 2.1. Feature
x <- mtcars$mpg
# 2.2. Range of feature 'x'
range(x)
# 2.3. Equal-width discretization
disc_width(x, 4)
# 2.4. Equal-frequency discretization
disc_freq(x, 5)
Upvotes: 2
Reputation: 886938
We can use table
xcut <- cut(xyz,breaks=c(-Inf,2,3, Inf), labels = c("<2", "2-3", ">3"))
table(seq_along(xcut), xcut)
set.seed(24)
xyz <- rnorm(20,3,1)
Upvotes: 4