Reputation: 31
I'm looking for way to generate dummy variables that separate given categories into all possible grouping combinations. For example, if we have three categories (say A, B and C), there are five possible groupings:
Three groups: A / B / C
Two groups: A&B / C
Two groups: A&C / B
Two groups: A / B&C
One group: A&B&C
Then dummy variable for each grouping would be output to different columns of a data frame. So the final output I want looks like the following table:
sample_num category grouping1 grouping2 grouping3 grouping4 grouping5
A; B; C A&B; C A&C; B A; B&C A&B&C
-----------+---------+------------+-----------+-----------+-----------+----------
1 A 1 1 1 1 1
2 A 1 1 1 1 1
3 A 1 1 1 1 1
4 A 1 1 1 1 1
5 B 2 1 2 2 1
6 B 2 1 2 2 1
7 B 2 1 2 2 1
8 C 3 2 1 2 1
9 C 3 2 1 2 1
10 C 3 2 1 2 1
11 C 3 2 1 2 1
12 C 3 2 1 2 1
Upvotes: 3
Views: 2210
Reputation: 263481
The model.matrix function in the stats
package (loaded by default) will construct "dummy variables" although not of the sort you describe. The first argument is an R "formula":
>dat <- read.table(text="sample_num category
+ 1 A
+ 2 A
+ 3 A
+ 4 A
+ 5 B
+ 6 B
+ 7 B
+ 8 C
+ 9 C
+ 10 C
+ 11 C
+ 12 C", header=TRUE)
> model.matrix( ~category, data=dat)
(Intercept) categoryB categoryC
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 1 0
6 1 1 0
7 1 1 0
8 1 0 1
9 1 0 1
10 1 0 1
11 1 0 1
12 1 0 1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$category
[1] "contr.treatment"
I (strongly) suspect your four-column group of dummies must be linearly dependent and one of them would get rejected by the regression functions. Other contrast arguments are possible. You should study:
?model.matrix
?contrasts
This is sum-contrasts with no intercept:
> model.matrix(~category+0, data=dat, contrasts = list(category = "contr.sum"))
categoryA categoryB categoryC
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 0 1 0
6 0 1 0
7 0 1 0
8 0 0 1
9 0 0 1
10 0 0 1
11 0 0 1
12 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$category
[1] "contr.sum"
If you want to look at automatic calculation of varying levels of interaction, you will need three variables, rather than one variable with three levels:
> dat <- expand.grid(A=letters[1:3], B=letters[4:6], C=letters[7:9])
> str(model.matrix( ~ A*B*C))
Error in str(model.matrix(~A * B * C)) :
error in evaluating the argument 'object' in selecting a method for function 'str': Error in model.frame.default(object, data, xlev = xlev) :
invalid type (closure) for variable 'C'
> str(model.matrix( ~ A*B*C, data=dat))
num [1:27, 1:27] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:27] "1" "2" "3" "4" ...
..$ : chr [1:27] "(Intercept)" "Ab" "Ac" "Be" ...
- attr(*, "assign")= int [1:27] 0 1 1 2 2 3 3 4 4 4 ...
- attr(*, "contrasts")=List of 3
..$ A: chr "contr.treatment"
..$ B: chr "contr.treatment"
..$ C: chr "contr.treatment"
model.matrix( ~ A*B*C, data=dat)
omitted output
Upvotes: 2