Reputation: 13
I have grouped data composed of students clustered within 160 schools. I would like to take a random sample of 30 schools from that dataset. I hard-coded a solution (see below), but is there a wrapper function or quicker way to do this in R? Kind of like sample_n() or top_n(), but those return n observations per group, whereas I want 100% of the observations from n groups.
# First, some example data. Each row represents one student in a given school, and that student's favourite fruit.
df <- tribble(
~school_id, ~favourite_fruit,
#----------#---------------
1, "apple",
1, "banana",
2, "kiwi",
2, "tomato",
3, "strawberry",
3, "cherry",
4, "orange",
4, "lime"
)
# My hard-coded solution
school_vector <- df %>%
group_by(school_id) %>%
select(school_id) %>%
count() %>%
ungroup() %>%
select(school_id) %>%
sample_n(2)
df_subset <- df %>%
filter(school_id %in% school_vector$school_id) %>%
as_tibble()
Upvotes: 1
Views: 75
Reputation: 28705
You can create a sample of school_id
s within filter
and use that with your current %in%
logic
df %>%
filter(school_id %in% sample(unique(school_id), 2))
# # A tibble: 4 x 2
# school_id favourite_fruit
# <dbl> <chr>
# 1 3 strawberry
# 2 3 cherry
# 3 4 orange
# 4 4 lime
As a function:
group_samp <- function(df, group_var, n){
df %>%
filter({{group_var}} %in% sample(unique({{group_var}}), n))
}
df %>%
group_samp(school_id, 2)
# # A tibble: 4 x 2
# school_id favourite_fruit
# <dbl> <chr>
# 1 1 apple
# 2 1 banana
# 3 2 kiwi
# 4 2 tomato
Upvotes: 4