emudrak
emudrak

Reputation: 1016

Testing/Training data sets stratified on two crossed variables

I have a data set which is crossed with respect to two categorical variables, and only 1 rep per combination:

> examp <- data.frame(group=rep(LETTERS[1:4], each=6), class=rep(LETTERS[16:21], times=4))
> table(examp$group, examp$class)

    P Q R S T U
  A 1 1 1 1 1 1
  B 1 1 1 1 1 1
  C 1 1 1 1 1 1
  D 1 1 1 1 1 1

I need to create a testing/training data set (50/50 split) which balances both group and class.

I know I can use createDataPartition from the caret package to balance it in one of the two factors, but this leaves impalance in the other factor:

> library(caret)
> examp$valid <- "test"
> examp$valid[createDataPartition(examp$group, p=0.5, list=FALSE)] <- "train"
> table(examp$group, examp$valid)

    test train
  A    3     3
  B    3     3
  C    3     3
  D    3     3
> table(examp$class, examp$valid)

    test train
  P    1     3
  Q    2     2
  R    2     2
  S    2     2
  T    2     2
  U    3     1
> 
> 
> examp$valid <- "test"
> examp$valid[createDataPartition(examp$class, p=0.5, list=FALSE)] <- "train"
> table(examp$group, examp$valid)

    test train
  A    3     3
  B    3     3
  C    5     1
  D    1     5
> table(examp$class, examp$valid)

    test train
  P    2     2
  Q    2     2
  R    2     2
  S    2     2
  T    2     2
  U    2     2

How can I create a partition which is balanced in both factors? If I had multiple reps per group/class combination, I would stratify by interaction(group,class), but I cannot in this case since there is only one observation in each combo.

Upvotes: 1

Views: 154

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145775

I propose this algorithm

  1. Randomly sort the unique group values (e.g., DBAC)
  2. Iterate over adjacent pairs of the randomly sorted group values (e.g., first DB, then AC):
    1. Randomly pick half of the class values
    2. Assign the rows with the first group and in the selected half of class to TRAIN
    3. Assign the rows with the second group and not in the selected half of class to TEST

Upvotes: 1

Related Questions