Michelle
Michelle

Reputation: 1363

How do I sample specific sizes within groups?

I have a specific use problem. I want to sample exact sizes from within groups. What method should I use to construct exact subsets based on group counts?

My use case is that I am going through a two-stage sample design. First, for each group in my population, I want to ensure that 60% of subjects will not be selected. So I am trying to construct a sampling data frame that excludes 60% of available subjects for each group. Further, this is a function where the user specifies the minimum proportion of subjects that must not be used, hence the 1- construction where the user has indicated that at least 60% of subjects in each group cannot be selected for sampling.

After this code, I will be sampling completely at random, to get my final sample.

Code example:

testing <- data.frame(ID = c(seq_len(50)), Age = c(rep(18, 10), rep(19, 9), rep(20,15), rep(21,16)))

testing <- testing %>%
slice_sample(ID, prop=1-.6)

As you can see, the numbers by group are not what I want. I should only have 4 subjects who are 18 years of age, 3 subjects who are 19 years, 6 subjects who are 20 years of age, and 6 subjects who are 21 years of age. With no set seed, the numbers I ended up with were 6 18-year-olds, 1 19-year-old, 6 20-year-olds, and 7 21-year-olds.

However, the overall sample size of 20 is correct.

How do I brute force the sample size within the groups to be what I need?

There are other variables in the data frame so I need to sample randomly from each age group.

EDIT: Messed up trying to give an example. In my real data I am grouping by age inside the dplyr set of commands. But neither group-by([Age variable) ahead of slice_sample() or doing the grouping inside slice_sample() work. In my real data, I get neither the correct set of samples by age, nor do I get the correct overall sample size.

I was using a semi_join to limit the ages to those that had a total remaining after doing the proportion test. For those ages for which no sample could be taken, the semi_join was being used to remove those ages from the population ahead of doing the proportional sampling. I don't know if the semi_join has caused the problem.

That said, the answer provided and accepted shifts me away from relying on the semi_join and I think is an overall large improvement to my real code.

Upvotes: 2

Views: 2195

Answers (1)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193507

You haven't defined your grouping variable.

Try the following:

set.seed(1)
x <- testing %>% group_by(Age) %>% slice_sample(prop = .4)
x %>% count()
# # A tibble: 4 x 2
# # Groups:   Age [4]
#     Age     n
#   <dbl> <int>
# 1    18     4
# 2    19     3
# 3    20     6
# 4    21     6

Alternatively, try stratified from my "splitstackshape" package:

library(splitstackshape)
set.seed(1)
y <- stratified(testing, "Age", .4)
y[, .N, Age]
#    Age N
# 1:  18 4
# 2:  19 4
# 3:  20 6
# 4:  21 6

Upvotes: 3

Related Questions