Reputation: 633
I made a simple example that illustrates what I want to achieve. Lets say I have this data frame:
x <- data.frame(a=1:10,b = factor(c("a","a","a","a","a","b","b","b","b","b")),
gender = factor(c("boy","girl","boy","girl","girl","boy","boy","boy","girl","boy")))
The data frame has 10 observations. 40% girls, 60% boys. 50% a, 50% b.
I want to be able to form a random sample that maintains the ratios of the selected key variables in the sample, so in this case I want that in my sample the ratio of girls will be 40% and of boys will be 60%, and also a 50%, and b 50%. How can I do that? The examples I found on the internet all assume a common ratio for all the variables, its not good for my purposes. Thanks!
Upvotes: 0
Views: 1201
Reputation: 19756
As noted in the comments for a large enough sample the ratios in the sub-samples should be similar. For smaller data sets here is an approach:
library(tidyverse)
library(caret)
create a group that is an interaction of the two factors and split according to that. Since your sample is very small this can not produce the exact proportion (no method can):
x %>%
select(b, gender) %>%
group_by(b, gender) %>%
group_indices() -> ind
split1 <- createDataPartition(as.factor(ind), p = 0.5)[[1]]
table(x[split1,2])
#output
a b
2 2
table(x[split1,3])
#output
boy girl
3 1
with twice as big data set:
x <- rbind(x, x)
x %>%
select(b, gender) %>%
group_by(b, gender) %>%
group_indices() -> ind
split1 <- createDataPartition(as.factor(ind), p = 0.5)[[1]]
table(x[split1,2])
#output
a b
5 5
table(x[split1,3])
#output
boy girl
6 4
try other ratio:
split1 <- createDataPartition(as.factor(ind), p = 0.7)[[1]]
table(x[split1,2])
#output
a b
8 8
table(x[split1,3])
#output
boy girl
9 7
Upvotes: 2