Reputation: 3021
I have a variable that is a factor :
$ year : Factor w/ 8 levels "2003","2004",..: 4 6 4 2 4 1 3 3 7 2 ...
I would like to create 8 dummy variables, named "2003", "2004" etc that take the value 0 or 1 depending on the value that the variable "year" takes. The nearest I could come up with is
dt1 <- cbind (dt1, model.matrix(~dt1$year - 1) )
But this has the unfortunate consequences of
model.matrix
(so the above command fails due to different lengths when NA is present in the year
variable).Of course I can get around these problems with more code, but I like my code to be as concise as possible (within reason) so if anyone can suggest better ways to make the dummy variables I would be obliged.
Upvotes: 5
Views: 13163
Reputation: 375
library(caret)
provides a very simple function (dummyVars
) to create dummy variables, especially when you have more than one factor variables. But you have to make sure the target variables are factor. e.g. if your Sales$year
are numeric, you have to convert them to factor: as.factor(Sales$year)
Suppose we have the original dataset 'Sales' as follows:
year Sales Region
1 2010 3695.543 North
2 2010 9873.037 West
3 2008 3579.458 West
4 2005 2788.857 North
5 2005 2952.183 North
6 2008 7255.337 West
7 2005 5237.081 West
8 2010 8987.096 North
9 2008 5545.343 North
10 2008 1809.446 West
Now we can create two dummy variables simultaneously:
>library(lattice)
>library(ggplot2)
>library(caret)
>Salesdummy <- dummyVars(~., data = Sales, levelsOnly = TRUE)
>Sdummy <- predict(Salesdummy, Sales)
The outcome will be:
2005 2008 2010 Sales RegionNorth RegionWest
1 0 0 1 3695.543 1 0
2 0 0 1 9873.037 0 1
3 0 1 0 3579.458 0 1
4 1 0 0 2788.857 1 0
5 1 0 0 2952.183 1 0
6 0 1 0 7255.337 0 1
7 1 0 0 5237.081 0 1
8 0 0 1 8987.096 1 0
9 0 1 0 5545.343 1 0
10 0 1 0 1809.446 0 1
Upvotes: 1
Reputation: 226057
This is as concise as I could get. The na.action
option takes care of the NA
values (I would rather do this with an argument than with a global options setting, but I can't see how). The naming of columns is pretty deeply hard-coded, don't see any way to override it within model.matrix
...
options(na.action=na.pass)
dt1 <- data.frame(year=factor(c(NA,2003:2005)))
dt2 <- setNames(cbind(dt1,model.matrix(~year-1,data=dt1)),
c("year",levels(dt1$year)))
As pointed out above, you may run into trouble in some contexts with column names that are not legal R variable names.
year 2003 2004 2005
1 <NA> NA NA NA
2 2003 1 0 0
3 2004 0 1 0
4 2005 0 0 1
Upvotes: 2
Reputation: 5341
You could use ifelse()
which won't omit na
rows (but I guess you might not count it as being "as concise as possible"):
dt1 <- data.frame(year=factor(rep(2003:2010, 10))) # example data
dt1 <- within(dt1, yr2003<-ifelse(year=="2003", 1, 0))
dt1 <- within(dt1, yr2004<-ifelse(year=="2004", 1, 0))
dt1 <- within(dt1, yr2005<-ifelse(year=="2005", 1, 0))
# ...
head(dt1)
# year yr2003 yr2004 yr2005
# 1 2003 1 0 0
# 2 2004 0 1 0
# 3 2005 0 0 1
# 4 2006 0 0 0
# 5 2007 0 0 0
# 6 2008 0 0 0
Upvotes: 2