Reputation: 83
I have what may be a bit of a tall order issue, and not sure whether it goes beyond the scope of this thread, but thought I would give it a shot.
I’m currently working on a data set that includes respondent ID (of which there are 972), Age Group, Region, Race, and Gender.
I am looking for a way to group each respondent in either “Study 1” or “Study 2” within each demographic variable
So for example, in the data set below, there are a total of 43 Males. I'm looking for a way to split those Males equally among each variable. If I then filter down to White, Male, from the West age 13 to 15, there are four left. I would like to randomly choose either a "Study 1" or "Study 2" grouping so that those 4 are divided evenly (2 cases are put into Study 1, and 2 cases into Study 2). I would like to do this for the rest of the cases as well. If there are an odd number of cases, I'd like to divide them up evenly-ish (so if there are 3 White Males from the Midwest aged 7 to 9, two cases would get Study 1, and the other Study 2, or vice versa).
This stratification rule needs to hold true if I use different combinations of the other filters (so let’s say of those 972 respondents there are 13 Hispanic females who are from the South and are age 7 to 9), I would need to split up that sample so that 7 of those respondents are in Study 1 and the remaining 6 are in Study 2.
I’m not sure if this is outside of the scope of this forum, but thought I’d check in with some experts.
I've tried using the "MOD" function in Excel, which gets me some of the way there, but it's not splitting the sample in quite the way I want.
data <- read.table(text =
"ID Age Gender Race Region Desired
370 4788 16to18 Male Hispani West Study1
371 4858 4to6 Male Hispani Northeast Study1
372 4863 7to9 Male Hispani South Study1
373 4884 10to12 Female Hispani Northeast Study1
374 4911 4to6 Female Hispani Northeast Study1
375 4967 13to15 Female Hispani West Study1
376 4980 4to6 Male Hispani South Study1
377 5054 13to15 Male Hispani Midwest Study1
378 5074 4to6 Male Hispani Northeast Study2
583 930 4to6 Female White Northeast Study1
584 931 7to9 Male White South Study1
585 937 4to6 Male White South Study1
586 938 10to12 Male White Midwest Study1
587 939 13to15 Male White Northeast Study1
588 941 16to18 Male White West Study1
589 944 10to12 Female White Midwest Study1
590 946 4to6 Male White Midwest Study1
591 949 13to15 Female White West Study1
592 952 16to18 Male White Northeast Study1
593 953 13to15 Female White South Study1
594 959 10to12 Male White Northeast Study1
595 957 10to12 Female White South Study1
596 961 16to18 Female White Midwest Study1
597 963 13to15 Male White South Study1
598 965 7to9 Male White Midwest Study1
599 971 13to15 Female White West Study2
600 976 13to15 Male White South Study2
601 982 16to18 Female White Midwest Study2
602 983 10to12 Female White Northeast Study1
603 986 13to15 Male White West Study1
604 992 10to12 Female White West Study1
605 994 4to6 Female White Midwest Study1
606 997 13to15 Male White West Study2
607 999 10to12 Male White South Study1
608 1013 10to12 Male White West Study1
609 1011 4to6 Female White Northeast Study2
610 1016 7to9 Female White West Study2
611 1022 16to18 Male White South Study1
612 1023 7to9 Male White Northeast Study1
613 1026 16to18 Female White West Study1
614 1027 7to9 Male White West Study1
615 1030 4to6 Male White Northeast Study1
616 1033 10to12 Female White Midwest Study2
617 1034 13to15 Male White Midwest Study1
618 1036 7to9 Female White West Study1
619 1039 16to18 Female White Northeast Study1
620 1042 16to18 Female White West Study2
621 1044 10to12 Female White South Study2
622 1049 13to15 Female White Northeast Study1
623 1050 4to6 Female White South Study1
624 1051 7to9 Male White South Study2
625 1052 13to15 Male White Northeast Study2
626 1053 10to12 Male White South Study2
627 1054 13to15 Male White West Study1
628 1055 7to9 Female White South Study1
629 1058 10to12 Male White South Study1
630 1061 16to18 Male White Midwest Study1
631 1062 10to12 Male White South Study2
632 1066 7to9 Male White South Study1
633 1067 13to15 Male White South Study1
634 1071 16to18 Male White South Study2
635 1072 16to18 Female White Midwest Study1
636 1074 10to12 Female White South Study1
637 1075 10to12 Female White Northeast Study2
638 1078 16to18 Female White Midwest Study2
639 1080 7to9 Male White South Study2
640 1083 4to6 Female White South Study2
641 1093 7to9 Female White Midwest Study1
642 1097 4to6 Female White West Study1
643 1102 10to12 Male White Midwest Study2
644 1104 13to15 Male White West Study2
645 1105 7to9 Male White Midwest Study2
646 1110 13to15 Male White Northeast Study1
647 1113 7to9 Female White Midwest Study2
648 1119 10to12 Female White West Study2
649 1120 10to12 Male White West Study2
650 1122 13to15 Female White West Study1
651 1124 16to18 Female White Midwest Study1
721 1384 7to9 Male White South Study1" , stringsAsFactors=F, header = T)
Upvotes: 3
Views: 455
Reputation: 160407
Your sample data is nice, but it doesn't provide sufficient variability to give you a spread in every combination. This may be just blind luck or a factor of the sampling you have provided. Either way, the premises of this answer doesn't change for demonstration.
I'm assuming you do not need exact matches in your Desired
column, just the intent of uniform distribution of Study
among each stratification.
I'll use dplyr
since I think it's clear each step what is being done. One could use sample_frac
or runif(n()) < 0.5
for it, but there is no guarantee that you'll get uniform-ish distribution. In this implementation, I just order all rows randomly and assign the 1 or 2 variable across all rows. Based on this, there should never be a difference of more than 1 between study 1 and 2 without a specific combination of factors.
In order to demo with low n
per group, I'll dumb it down to just two factors: Age and Gender.
library(dplyr)
set.seed(2) # for reproducibility only, do not include in production code
studies <- 1:2
out <- data %>%
sample_n(n()) %>%
group_by(Age, Gender) %>%
mutate(Study = rep(studies, length.out = n())) %>%
ungroup()
arrange(out, ID)
# # A tibble: 79 x 7
# ID Age Gender Race Region Desired Study
# <int> <chr> <chr> <chr> <chr> <chr> <int>
# 1 930 4to6 Female White Northeast Study1 1
# 2 931 7to9 Male White South Study1 1
# 3 937 4to6 Male White South Study1 2
# 4 938 10to12 Male White Midwest Study1 1
# 5 939 13to15 Male White Northeast Study1 2
# 6 941 16to18 Male White West Study1 1
# 7 944 10to12 Female White Midwest Study1 1
# 8 946 4to6 Male White Midwest Study1 1
# 9 949 13to15 Female White West Study1 2
# 10 952 16to18 Male White Northeast Study1 1
# # ... with 69 more rows
One way we can see if it's working is to tabulate it. The original data:
xtabs(~ Gender + Age, data = data)
# Age
# Gender 10to12 13to15 16to18 4to6 7to9
# Female 10 6 8 7 5
# Male 9 12 6 6 10
and those chose for each study, showing equal distribution between the two studies:
xtabs(~ Study + Age + Gender, data = out)
# , , Gender = Female
# Age
# Study 10to12 13to15 16to18 4to6 7to9
# 1 5 3 4 4 3
# 2 5 3 4 3 2
# , , Gender = Male
# Age
# Study 10to12 13to15 16to18 4to6 7to9
# 1 5 6 3 3 5
# 2 4 6 3 3 5
And to show that there's never more than 1 more/less within any one strata:
group_by(out, Age, Gender) %>% summarize(differences = diff(range(table(Study))))
# # A tibble: 10 x 3
# # Groups: Age [5]
# Age Gender differences
# <chr> <chr> <int>
# 1 10to12 Female 0
# 2 10to12 Male 1
# 3 13to15 Female 0
# 4 13to15 Male 0
# 5 16to18 Female 0
# 6 16to18 Male 0
# 7 4to6 Female 1
# 8 4to6 Male 0
# 9 7to9 Female 1
# 10 7to9 Male 0
I repeated with up to 10 different studies, and there was never more than +/- 1 between studies within a strata.
For your implementation where you want to preserve use of all four factors, you'll use:
out <- data %>%
sample_n(n()) %>%
group_by(Age, Gender, Race, Region) %>% # <--- the only difference
mutate(Study = rep(studies, length.out = n())) %>%
ungroup()
I should add that this extends well for more than two studies as well (e.g., students <- 1:3
: the combined use of sample_n
and rep(..., length.out=)
assures that you'll never have a difference of more than 1 between the studies for each strata .
Upvotes: 3
Reputation: 5861
that's a good question for this forum. And kudos on the reproducible example!
Here's one way you could approach this question. I highly recommend the tidyverse
package, it's got a lot of great functions.
library(tidyverse) # load the tidyverse library, if you don't have it, install it first
# take your data,
Study1 <- data %>%
# group by these variables
group_by(Age, Gender, Race, Region) %>%
# sample 50 percent of each group
sample_frac(0.5) %>%
# extract a vector that corresponds to the IDs of the sampled participants.
pull(ID)
Study1 # These are all participants for study 1
# now, give each person either "Study1" or "Study2"
# If the person's ID is in the vector "Study1", make the value of a new
# variable, "Study", equal to "Study1". If their ID is NOT in that vector,
# then make them part of "Study2".
data <- data %>%
mutate(Study = ifelse(ID %in% Study1, "Study1", "Study2"))
Upvotes: 3