Reputation: 1016
I have a data set which is crossed with respect to two categorical variables, and only 1 rep per combination:
> examp <- data.frame(group=rep(LETTERS[1:4], each=6), class=rep(LETTERS[16:21], times=4))
> table(examp$group, examp$class)
P Q R S T U
A 1 1 1 1 1 1
B 1 1 1 1 1 1
C 1 1 1 1 1 1
D 1 1 1 1 1 1
I need to create a testing/training data set (50/50 split) which balances both group and class.
I know I can use createDataPartition
from the caret
package to balance it in one of the two factors, but this leaves impalance in the other factor:
> library(caret)
> examp$valid <- "test"
> examp$valid[createDataPartition(examp$group, p=0.5, list=FALSE)] <- "train"
> table(examp$group, examp$valid)
test train
A 3 3
B 3 3
C 3 3
D 3 3
> table(examp$class, examp$valid)
test train
P 1 3
Q 2 2
R 2 2
S 2 2
T 2 2
U 3 1
>
>
> examp$valid <- "test"
> examp$valid[createDataPartition(examp$class, p=0.5, list=FALSE)] <- "train"
> table(examp$group, examp$valid)
test train
A 3 3
B 3 3
C 5 1
D 1 5
> table(examp$class, examp$valid)
test train
P 2 2
Q 2 2
R 2 2
S 2 2
T 2 2
U 2 2
How can I create a partition which is balanced in both factors? If I had multiple reps per group/class combination, I would stratify by interaction(group,class)
, but I cannot in this case since there is only one observation in each combo.
Upvotes: 1
Views: 154
Reputation: 145775
I propose this algorithm
group
values (e.g., DBAC
)group
values (e.g., first DB
, then AC
):
class
valuesgroup
and in the selected half of class
to TRAIN
group
and not in the selected half of class
to TEST
Upvotes: 1