Reputation: 1053
I am trying to convert a column that has categorical data ('A', 'B', or 'C') to 3 columns where 1,0,0 would be 'A'; 0,1,0 would represent 'B', etc.
I found this code online:
flags = data.frame(Reduce(cbind,
lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))
names(flags) = levels(d$purpose)
d = cbind(d, flags)
# Include the new columns as input variables
levelnames = paste(names(flags), collapse = " + ")
neuralnet(paste("output ~ ", levelnames), d)
Converting categorical variables in R for ANN (neuralnet)
But I'm very new to R. Can anyone break down what this complicated looking code is doing?
edit:
Implementing @nongkrong's recommendations I'm running into a problem:
CSV:
X1,X2,X3
A,D,Q
B,E,R
C,F,S
B,G,T
C,H,U
A,D,Q
R:
newData <- read.csv("new.csv")
newerData <- model.matrix(~ X1 + X2 + X3 -1, data=newData)
newerData
R Output:
X1A X1B X1C X2E X2F X2G X2H X3R X3S X3T X3U
1 1 0 0 0 0 0 0 0 0 0 0
2 0 1 0 1 0 0 0 1 0 0 0
3 0 0 1 0 1 0 0 0 1 0 0
4 0 1 0 0 0 1 0 0 0 1 0
5 0 0 1 0 0 0 1 0 0 0 1
6 1 0 0 0 0 0 0 0 0 0 0
It works great with 1 column, but is missing X2D and X3Q. Any ideas why?
Upvotes: 1
Views: 3986
Reputation: 34703
@nongkrong is right--read ?formulas
and you'll see that most functions that accept formula
s as input (e.g. lm
, glm
, etc.) will automatically convert categorical variables (stored as factor
s or character
s) to dummies; you can force this on non-factor
numeric variables by specifying as.factor(var)
in your formula.
That said, I've encountered situations where it's convenient to have created these indicators by hand anyway--e.g., a data set with an ethnicity variable where <1% of the data fit in one or several of the ethnicity codes. There are other ways to deal with this (simply delete the minority-minority observations, e.g.), but I find that varies by situation.
So, I've annotated the code for you:
flags = data.frame(Reduce(cbind,
lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))
Lots going on in this first line, so let's go bit-by-bit:
d$purpose==x
checks each entry of d$purpose
for equality to x
; the result will be TRUE
or FALSE
(or NA
if there are missing values). Multiplying by 1
(*1
) forces the output to be an integer (so TRUE
becomes 1
and FALSE
becomes 0
).
lapply
applies the function in its second argument to each element of its first argument--so for each element of levels(d$purpose)
(i.e., each level of d$purpose
), we output a vector of 0
s and 1
s, where the 1
s correspond to the elements of d$purpose
matching the given level. The output of lapply
is a list
(hence l
in front of apply), with one list element corresponding to each of the levels of d$purpose
.
We want to get this into our data.frame
, so a list
isn't very useful; Reduce
is what we use to back out the information from the list
to a data.frame
form. Reduce(cbind,LIST)
is the same as cbind(LIST[[1]],LIST[[2]],LIST[[3]],...)
--convenient shorthand, especially when we don't know the length of LIST
.
Wrapping this in data.frame
casts this into the mode data.frame
.
#This line simply puts column names on each of the indicator variables
# Note that you can replace the RHS of this line with whatever
# naming convention you want for the levels--a common approach might
# be to specify paste0(levels(d$purpose),"_flag"), e.g.
names(flags) = levels(d$purpose)
#this line adds all the indicator variables to the original
# data.frame
d = cbind(d, flags)
#this creates a string of the form "level1 + level2 + ... + leveln"
levelnames = paste(names(flags), collapse = " + ")
#finally we create a formula of the form y~x+d1+d2+d3
# where each of the d* is a dummy for a level of the categorical variable
neuralnet(paste("output ~ ", levelnames), d)
Also note that something like this could have been done much simpler in the data.table
package:
library(data.table)
setDT(d)
l = levels(purpose)
d[ , (l) := lapply(l, function(x) as.integer(purpose == x))]
d[ , neuralnet(paste0("output~", paste0(l, collapse = "+"))]
Upvotes: 2