Reputation: 1454
I have a data.frame
and I need to extract a sample from it. For each year I want 50 observations according to population weights. Here is some example code:
library(dplyr)
set.seed(1234)
ex.df <- data.frame(value=runif(1000),
year = rep(1991:2010, each=50),
group= sample(c("A", "B", "C"), 1000, replace=T)) %>%
mutate(pop.weight = ifelse(group=="A", 0.5,
ifelse(group=="B", 0.3,
ifelse(group=="C", 0.2, group))))
set.seed(1234)
test <- ex.df %>%
group_by(year) %>%
sample_n(50, weight=pop.weight) %>%
ungroup()
table(test$group)/sum(table(test$group))
A B C
0.329 0.319 0.352
Group A
should be represented with about 50%, group B
with 30%, and C
with around 20%. What did I miss?
Upvotes: 1
Views: 587
Reputation: 39174
Set replace = TRUE
. You want 50 observations per year but ex.df
only contain 50 observation per year, if replace = FALSE
it would just return the same rows with different order.
set.seed(1234)
test <- ex.df %>%
group_by(year) %>%
sample_n(50, weight=pop.weight, replace = TRUE) %>%
ungroup()
table(test$group)/sum(table(test$group))
# A B C
# 0.509 0.299 0.192
Or you can increase the observation number per year in ex.df
. In the following example, I change the observation per year to be 5000, the ratio in resulting test
looks reasonable.
set.seed(1234)
ex.df <- data.frame(value=runif(100000),
year = rep(1991:2010, each=5000),
group= sample(c("A", "B", "C"), 1000, replace=T)) %>%
mutate(pop.weight = ifelse(group=="A", 0.5,
ifelse(group=="B", 0.3,
ifelse(group=="C", 0.2, group))))
set.seed(1234)
test <- ex.df %>%
group_by(year) %>%
sample_n(50, weight=pop.weight) %>%
ungroup()
table(test$group)/sum(table(test$group))
# A B C
# 0.515 0.276 0.209
Upvotes: 1