Bae
Bae

Reputation: 97

Make dummy variables for a categorial variable

Let's say I have a data frame df as follows:

df <- data.frame(type = c("A","B","AB","O","O","B","A"))

Obviously there are 4 kinds of type. However, in my actual data, I don't know how many kinds are in a column type. The number of dummy variables should be one less than the number of kinds in type. In this example, number of dummy variables should be 3. My expected output looks like this:

df <- data.frame(type = c("A","B","AB","O","O","B","A"),
                 A = c(1,0,0,0,0,0,1),
                 B = c(0,1,0,0,0,1,0),
                 AB = c(0,0,1,0,0,0,0))

Here I used A, B and AB as dummy variables, but whatever I choose from type doesn't matter. Even if I don't know the values of type and the number of kinds, I somehow want to make it as dummy variables.

Upvotes: 1

Views: 217

Answers (1)

Zheyuan Li
Zheyuan Li

Reputation: 73265

The number of dummy variables should be one less than the number of kinds in type.

Here I used "A", "B" and "AB" as dummy variables, but whatever I choose from type doesn't matter.

Even if I don't know the values in type and the number of kinds, I somehow want to make it as dummy variables.

This is treatment contrasts coding. First, you need a factor variable.

## option 1: if you care the order of dummy variables
## the 1st level is not in dummy variables
## I do this to match your example output with "A", "B" and "AB"
f <- factor(df$type, levels = c("O", "A", "B", "AB"))

## option 2: if you don't care, then let R automatically order levels
f <- factor(df$type)

Now, apply treatment contrasts coding.

## option 1 (recommended): using contr.treatment()
m <- contr.treatment(nlevels(f))[f, ]

## option 2 (less efficient): using model.matrix()
m <- model.matrix(~ f)[, -1]

Finally you want to have nice row/column names for readability.

dimnames(m) <- list(1:length(f), levels(f)[-1])

The resulting m looks like:

#   A  B  AB
#1  1  0   0
#2  0  1   0
#3  0  0   1
#4  0  0   0
#5  0  0   0
#6  0  1   0
#7  1  0   0

This is a matrix. If you want a data frame, do data.frame(m).

Upvotes: 2

Related Questions