Reputation: 359
I need to create some unorthodox dummy variables and I am having some trouble. Essentially in my dataset each teacher can teach multiple classes. I'm building a multilevel dataset, so it is ok that there are duplicate teacher IDs.
Here is an example of the data:
#generate data
teacher.id <- c(1:5, 1:5)
class.taught <- c("ELA", "Math", "Science", "ELA", "Math", "Science", "Math", "ELA", "ELA", "Math")
# combine into data frame
dat <- data.frame(teacher.id, class.taught)
As you can see teachers with IDs 1 and 3 both teach 2 different classes.
The conventional approach to creating dummy variables yields:
# example of what I have done so far
dat$teach.ELA <- ifelse(dat$class.taught == "ELA", 1, 0 )
dat$teach.MATH <- ifelse(dat$class.taught == "Math", 1, 0 )
dat$teach.SCIENCE <- ifelse(dat$class.taught == "Science", 1, 0 )
dat
However, here is how I would like the new dummy variables to look:
desired.ELA <- c(1,0,1,1,0,1,0,1,1,0)
desired.MATH <- c(0,1,0,0,1,0,1,0,0,1)
desired.SCIENCE <- c(1,0,1,0,0,1,0,1,0,0)
dat.2 <- data.frame(dat, desired.ELA, desired.MATH, desired.SCIENCE)
dat.2
My hunch is that I need to loop through the ids to create these, but past that I really don't see my avenue to accomplish what I desire.
Upvotes: 2
Views: 87
Reputation: 936
You can also do this using %in%
:
dums <- function(dt, x){
ix <- dt[, 2] %in% x
dt[, 1] %in% unique(dt[ix, 1])
}
dums(dat, 'ELA')
dums(dat, 'Math')
dums(dat, 'Science')
This gives you TRUE/FALSE rather than 0/1 vectors, but as.integer
will convert them to 0/1 if necessary.
Upvotes: 3
Reputation: 38520
Here is a base R method. The idea is that you create the dummies for each teacher and then merge these onto the original data:
# get dummies for each teacher
temp <- as.data.frame(with(dat, table(teacher.id, class.taught) > 0))
temp$teacher.id <- as.integer(row.names(temp))
# merge onto dataset
merge(dat, temp, by="teacher.id")
You could coerce the logicals to integer if it really bugged you, but R will do all that work for you.
Upvotes: 4
Reputation: 1991
I'd use dplyr
and tidyr
.
library(dplyr)
library(tidyr)
dummies <-
dat %>%
group_by(teacher.id, class.taught) %>%
summarise(is_taught = as.numeric(n() > 0)) %>%
spread(class.taught, is_taught, fill = 0)
> dummies
Source: local data frame [5 x 4]
teacher.id ELA Math Science
(int) (dbl) (dbl) (dbl)
1 1 1 0 1
2 2 0 1 0
3 3 1 0 1
4 4 1 0 0
5 5 0 1 0
You can then have them in the original data using a join.
> inner_join(dat, dummies)
Joining by: "teacher.id"
teacher.id class.taught ELA Math Science
1 1 ELA 1 0 1
2 2 Math 0 1 0
3 3 Science 1 0 1
4 4 ELA 1 0 0
5 5 Math 0 1 0
6 1 Science 1 0 1
7 2 Math 0 1 0
8 3 ELA 1 0 1
9 4 ELA 1 0 0
10 5 Math 0 1 0
Upvotes: 1
Reputation: 2177
Just for fun, using dplyr:
library(dplyr)
dat %>% left_join(
dat %>%
group_by(teacher.id) %>%
summarize(desired.ELA = ifelse(sum(teach.ELA), 1, 0),
desired.MATH = ifelse(sum(teach.MATH), 1, 0),
desired.SCIENCE = ifelse(sum(teach.SCIENCE), 1, 0))
)
Output:
teacher.id class.taught teach.ELA teach.MATH teach.SCIENCE desired.ELA desired.MATH desired.SCIENCE
1 1 ELA 1 0 0 1 0 1
2 2 Math 0 1 0 0 1 0
3 3 Science 0 0 1 1 0 1
4 4 ELA 1 0 0 1 0 0
5 5 Math 0 1 0 0 1 0
6 1 Science 0 0 1 1 0 1
7 2 Math 0 1 0 0 1 0
8 3 ELA 1 0 0 1 0 1
9 4 ELA 1 0 0 1 0 0
10 5 Math 0 1 0 0 1 0
Upvotes: 2