Darko
Darko

Reputation: 83

Stratified Random Sample with a 50% Selection Rate

I have what may be a bit of a tall order issue, and not sure whether it goes beyond the scope of this thread, but thought I would give it a shot.

I’m currently working on a data set that includes respondent ID (of which there are 972), Age Group, Region, Race, and Gender.

I am looking for a way to group each respondent in either “Study 1” or “Study 2” within each demographic variable

So for example, in the data set below, there are a total of 43 Males. I'm looking for a way to split those Males equally among each variable. If I then filter down to White, Male, from the West age 13 to 15, there are four left. I would like to randomly choose either a "Study 1" or "Study 2" grouping so that those 4 are divided evenly (2 cases are put into Study 1, and 2 cases into Study 2). I would like to do this for the rest of the cases as well. If there are an odd number of cases, I'd like to divide them up evenly-ish (so if there are 3 White Males from the Midwest aged 7 to 9, two cases would get Study 1, and the other Study 2, or vice versa).

This stratification rule needs to hold true if I use different combinations of the other filters (so let’s say of those 972 respondents there are 13 Hispanic females who are from the South and are age 7 to 9), I would need to split up that sample so that 7 of those respondents are in Study 1 and the remaining 6 are in Study 2.

I’m not sure if this is outside of the scope of this forum, but thought I’d check in with some experts.

I've tried using the "MOD" function in Excel, which gets me some of the way there, but it's not splitting the sample in quite the way I want.

data <- read.table(text = 
    "ID   Age    Gender     Race    Region        Desired    
370 4788  16to18   Male    Hispani    West          Study1
371 4858  4to6     Male    Hispani    Northeast     Study1
372 4863  7to9     Male    Hispani    South         Study1
373 4884  10to12   Female  Hispani    Northeast     Study1
374 4911  4to6     Female  Hispani    Northeast     Study1
375 4967  13to15   Female  Hispani    West          Study1
376 4980  4to6     Male    Hispani    South         Study1
377 5054  13to15   Male    Hispani    Midwest       Study1
378 5074  4to6     Male    Hispani    Northeast     Study2
583 930   4to6     Female  White      Northeast     Study1
584 931   7to9     Male    White      South         Study1
585 937   4to6     Male    White      South         Study1
586 938   10to12   Male    White      Midwest       Study1
587 939   13to15   Male    White      Northeast     Study1
588 941   16to18   Male    White      West          Study1
589 944   10to12   Female  White      Midwest       Study1
590 946   4to6     Male    White      Midwest       Study1
591 949   13to15   Female  White      West          Study1
592 952   16to18   Male    White      Northeast     Study1
593 953   13to15   Female  White      South         Study1
594 959   10to12   Male    White      Northeast     Study1
595 957   10to12   Female  White      South         Study1
596 961   16to18   Female  White      Midwest       Study1
597 963   13to15   Male    White      South         Study1
598 965   7to9     Male    White      Midwest       Study1
599 971   13to15   Female  White      West          Study2
600 976   13to15   Male    White      South         Study2
601 982   16to18   Female  White      Midwest       Study2
602 983   10to12   Female  White      Northeast     Study1
603 986   13to15   Male    White      West          Study1
604 992   10to12   Female  White      West          Study1
605 994   4to6     Female  White      Midwest       Study1
606 997   13to15   Male    White      West          Study2
607 999   10to12   Male    White      South         Study1
608 1013  10to12   Male    White      West          Study1
609 1011  4to6     Female  White      Northeast     Study2
610 1016  7to9     Female  White      West          Study2
611 1022  16to18   Male    White      South         Study1
612 1023  7to9     Male    White      Northeast     Study1
613 1026  16to18   Female  White      West          Study1
614 1027  7to9     Male    White      West          Study1
615 1030  4to6     Male    White      Northeast     Study1
616 1033  10to12   Female  White      Midwest       Study2
617 1034  13to15   Male    White      Midwest       Study1
618 1036  7to9     Female  White      West          Study1
619 1039  16to18   Female  White      Northeast     Study1
620 1042  16to18   Female  White      West          Study2
621 1044  10to12   Female  White      South         Study2
622 1049  13to15   Female  White      Northeast     Study1
623 1050  4to6     Female  White      South         Study1
624 1051  7to9     Male    White      South         Study2
625 1052  13to15   Male    White      Northeast     Study2
626 1053  10to12   Male    White      South         Study2
627 1054  13to15   Male    White      West          Study1
628 1055  7to9     Female  White      South         Study1
629 1058  10to12   Male    White      South         Study1
630 1061  16to18   Male    White      Midwest       Study1
631 1062  10to12   Male    White      South         Study2
632 1066  7to9     Male    White      South         Study1
633 1067  13to15   Male    White      South         Study1
634 1071  16to18   Male    White      South         Study2
635 1072  16to18   Female  White      Midwest       Study1
636 1074  10to12   Female  White      South         Study1
637 1075  10to12   Female  White      Northeast     Study2
638 1078  16to18   Female  White      Midwest       Study2
639 1080  7to9     Male    White      South         Study2
640 1083  4to6     Female  White      South         Study2
641 1093  7to9     Female  White      Midwest       Study1
642 1097  4to6     Female  White      West          Study1
643 1102  10to12   Male    White      Midwest       Study2
644 1104  13to15   Male    White      West          Study2
645 1105  7to9     Male    White      Midwest       Study2
646 1110  13to15   Male    White      Northeast     Study1
647 1113  7to9     Female  White      Midwest       Study2
648 1119  10to12   Female  White      West          Study2
649 1120  10to12   Male    White      West          Study2
650 1122  13to15   Female  White      West          Study1
651 1124  16to18   Female  White      Midwest       Study1
721 1384  7to9     Male    White      South         Study1" , stringsAsFactors=F, header = T)

Upvotes: 3

Views: 455

Answers (2)

r2evans
r2evans

Reputation: 160407

Your sample data is nice, but it doesn't provide sufficient variability to give you a spread in every combination. This may be just blind luck or a factor of the sampling you have provided. Either way, the premises of this answer doesn't change for demonstration.

I'm assuming you do not need exact matches in your Desired column, just the intent of uniform distribution of Study among each stratification.

I'll use dplyr since I think it's clear each step what is being done. One could use sample_frac or runif(n()) < 0.5 for it, but there is no guarantee that you'll get uniform-ish distribution. In this implementation, I just order all rows randomly and assign the 1 or 2 variable across all rows. Based on this, there should never be a difference of more than 1 between study 1 and 2 without a specific combination of factors.

In order to demo with low n per group, I'll dumb it down to just two factors: Age and Gender.

library(dplyr)
set.seed(2) # for reproducibility only, do not include in production code

studies <- 1:2
out <- data %>%
  sample_n(n()) %>%
  group_by(Age, Gender) %>%
  mutate(Study = rep(studies, length.out = n())) %>%
  ungroup()

arrange(out, ID)
# # A tibble: 79 x 7
#       ID Age    Gender Race  Region    Desired Study
#    <int> <chr>  <chr>  <chr> <chr>     <chr>   <int>
#  1   930 4to6   Female White Northeast Study1      1
#  2   931 7to9   Male   White South     Study1      1
#  3   937 4to6   Male   White South     Study1      2
#  4   938 10to12 Male   White Midwest   Study1      1
#  5   939 13to15 Male   White Northeast Study1      2
#  6   941 16to18 Male   White West      Study1      1
#  7   944 10to12 Female White Midwest   Study1      1
#  8   946 4to6   Male   White Midwest   Study1      1
#  9   949 13to15 Female White West      Study1      2
# 10   952 16to18 Male   White Northeast Study1      1
# # ... with 69 more rows

One way we can see if it's working is to tabulate it. The original data:

xtabs(~ Gender + Age, data = data)
#         Age
# Gender   10to12 13to15 16to18 4to6 7to9
#   Female     10      6      8    7    5
#   Male        9     12      6    6   10

and those chose for each study, showing equal distribution between the two studies:

xtabs(~ Study + Age + Gender, data = out)
# , , Gender = Female
#      Age
# Study 10to12 13to15 16to18 4to6 7to9
#     1      5      3      4    4    3
#     2      5      3      4    3    2
# , , Gender = Male
#      Age
# Study 10to12 13to15 16to18 4to6 7to9
#     1      5      6      3    3    5
#     2      4      6      3    3    5

And to show that there's never more than 1 more/less within any one strata:

group_by(out, Age, Gender) %>% summarize(differences = diff(range(table(Study))))
# # A tibble: 10 x 3
# # Groups:   Age [5]
#    Age    Gender differences
#    <chr>  <chr>        <int>
#  1 10to12 Female           0
#  2 10to12 Male             1
#  3 13to15 Female           0
#  4 13to15 Male             0
#  5 16to18 Female           0
#  6 16to18 Male             0
#  7 4to6   Female           1
#  8 4to6   Male             0
#  9 7to9   Female           1
# 10 7to9   Male             0

I repeated with up to 10 different studies, and there was never more than +/- 1 between studies within a strata.

For your implementation where you want to preserve use of all four factors, you'll use:

out <- data %>%
  sample_n(n()) %>%
  group_by(Age, Gender, Race, Region) %>%               # <--- the only difference
  mutate(Study = rep(studies, length.out = n())) %>%
  ungroup()

I should add that this extends well for more than two studies as well (e.g., students <- 1:3: the combined use of sample_n and rep(..., length.out=) assures that you'll never have a difference of more than 1 between the studies for each strata .

Upvotes: 3

Nova
Nova

Reputation: 5861

that's a good question for this forum. And kudos on the reproducible example!

Here's one way you could approach this question. I highly recommend the tidyverse package, it's got a lot of great functions.

library(tidyverse)  # load the tidyverse library, if you don't have it, install it first

# take your data,
Study1 <- data %>% 
  # group by these variables
  group_by(Age, Gender, Race, Region) %>% 
  # sample 50 percent of each group
  sample_frac(0.5) %>% 
  # extract a vector that corresponds to the IDs of the sampled participants.
  pull(ID)

Study1  # These are all participants for study 1

# now, give each person either "Study1" or "Study2"
# If the person's ID is in the vector "Study1", make the value of a new 
# variable, "Study", equal to "Study1". If their ID is NOT in that vector, 
# then make them part of "Study2".

data <- data %>% 
  mutate(Study = ifelse(ID %in% Study1, "Study1", "Study2"))

Upvotes: 3

Related Questions