Corel
Corel

Reputation: 633

stratified random sampling with different ratios for group variables in R

I made a simple example that illustrates what I want to achieve. Lets say I have this data frame:

x <- data.frame(a=1:10,b = factor(c("a","a","a","a","a","b","b","b","b","b")),
            gender = factor(c("boy","girl","boy","girl","girl","boy","boy","boy","girl","boy")))

The data frame has 10 observations. 40% girls, 60% boys. 50% a, 50% b.

I want to be able to form a random sample that maintains the ratios of the selected key variables in the sample, so in this case I want that in my sample the ratio of girls will be 40% and of boys will be 60%, and also a 50%, and b 50%. How can I do that? The examples I found on the internet all assume a common ratio for all the variables, its not good for my purposes. Thanks!

Upvotes: 0

Views: 1201

Answers (1)

missuse
missuse

Reputation: 19756

As noted in the comments for a large enough sample the ratios in the sub-samples should be similar. For smaller data sets here is an approach:

library(tidyverse)
library(caret)

create a group that is an interaction of the two factors and split according to that. Since your sample is very small this can not produce the exact proportion (no method can):

x %>%
  select(b, gender) %>%
  group_by(b, gender) %>%
  group_indices() -> ind

split1 <- createDataPartition(as.factor(ind), p = 0.5)[[1]]

table(x[split1,2])
#output
a b 
2 2 

table(x[split1,3])
#output
 boy girl 
   3    1 

with twice as big data set:

x <- rbind(x, x)

x %>%
  select(b, gender) %>%
  group_by(b, gender) %>%
  group_indices() -> ind

split1 <- createDataPartition(as.factor(ind), p = 0.5)[[1]]

table(x[split1,2])
#output
a b 
5 5 

table(x[split1,3])
#output
 boy girl 
   6    4 

try other ratio:

split1 <- createDataPartition(as.factor(ind), p = 0.7)[[1]]

table(x[split1,2])
#output
a b 
8 8 

table(x[split1,3])
#output
 boy girl 
   9    7

Upvotes: 2

Related Questions