Reputation: 1053

Converting Categorical Columns into Multiple Binary Columns in R

I am trying to convert a column that has categorical data ('A', 'B', or 'C') to 3 columns where 1,0,0 would be 'A'; 0,1,0 would represent 'B', etc.

I found this code online:

flags = data.frame(Reduce(cbind, 
     lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))
names(flags) = levels(d$purpose)
d = cbind(d, flags)

# Include the new columns as input variables
levelnames = paste(names(flags), collapse = " + ")
neuralnet(paste("output ~ ", levelnames), d)

Converting categorical variables in R for ANN (neuralnet)

But I'm very new to R. Can anyone break down what this complicated looking code is doing?

edit:

Implementing @nongkrong's recommendations I'm running into a problem:

CSV:

X1,X2,X3
A,D,Q
B,E,R
C,F,S
B,G,T
C,H,U
A,D,Q

newData <- read.csv("new.csv")
newerData <- model.matrix(~ X1 + X2 + X3 -1, data=newData)
newerData

R Output:

  X1A X1B X1C X2E X2F X2G X2H X3R X3S X3T X3U
1   1   0   0   0   0   0   0   0   0   0   0
2   0   1   0   1   0   0   0   1   0   0   0
3   0   0   1   0   1   0   0   0   1   0   0
4   0   1   0   0   0   1   0   0   0   1   0
5   0   0   1   0   0   0   1   0   0   0   1
6   1   0   0   0   0   0   0   0   0   0   0

It works great with 1 column, but is missing X2D and X3Q. Any ideas why?

Upvotes: 1

Answers (1)

MichaelChirico

Reputation: 34703

@nongkrong is right--read ?formulas and you'll see that most functions that accept formulas as input (e.g. lm, glm, etc.) will automatically convert categorical variables (stored as factors or characters) to dummies; you can force this on non-factor numeric variables by specifying as.factor(var) in your formula.

That said, I've encountered situations where it's convenient to have created these indicators by hand anyway--e.g., a data set with an ethnicity variable where <1% of the data fit in one or several of the ethnicity codes. There are other ways to deal with this (simply delete the minority-minority observations, e.g.), but I find that varies by situation.

So, I've annotated the code for you:

flags = data.frame(Reduce(cbind, 
     lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))

Lots going on in this first line, so let's go bit-by-bit:

d$purpose==x checks each entry of d$purpose for equality to x; the result will be TRUE or FALSE (or NA if there are missing values). Multiplying by 1 (*1) forces the output to be an integer (so TRUE becomes 1 and FALSE becomes 0).

lapply applies the function in its second argument to each element of its first argument--so for each element of levels(d$purpose) (i.e., each level of d$purpose), we output a vector of 0s and 1s, where the 1s correspond to the elements of d$purpose matching the given level. The output of lapply is a list (hence l in front of apply), with one list element corresponding to each of the levels of d$purpose.

We want to get this into our data.frame, so a list isn't very useful; Reduce is what we use to back out the information from the list to a data.frame form. Reduce(cbind,LIST) is the same as cbind(LIST[[1]],LIST[[2]],LIST[[3]],...)--convenient shorthand, especially when we don't know the length of LIST.

Wrapping this in data.frame casts this into the mode data.frame.

#This line simply puts column names on each of the indicator variables
#  Note that you can replace the RHS of this line with whatever 
#  naming convention you want for the levels--a common approach might
#  be to specify paste0(levels(d$purpose),"_flag"), e.g.
names(flags) = levels(d$purpose)
#this line adds all the indicator variables to the original 
#  data.frame
d = cbind(d, flags)
#this creates a string of the form "level1 + level2 + ... + leveln"
levelnames = paste(names(flags), collapse = " + ")
#finally we create a formula of the form y~x+d1+d2+d3
#  where each of the d* is a dummy for a level of the categorical variable
neuralnet(paste("output ~ ", levelnames), d)

Also note that something like this could have been done much simpler in the data.table package:

library(data.table)
setDT(d)
l = levels(purpose)
d[ , (l) := lapply(l, function(x) as.integer(purpose == x))]
d[ , neuralnet(paste0("output~", paste0(l, collapse = "+"))]

Upvotes: 2

Converting Categorical Columns into Multiple Binary Columns in R

Answers (1)

Related Questions