Reputation:
So I have a dataframe (my.df) which I have grouped by the variable "strat". Each row consists of numerous variables. Example of what it looks like is below - I've simplified my.df for this example since it is quite large. What I want to do next is draw a simple random sample from each group. If I wanted to draw 5 observations from each group I would use this code:
new_df <- my.df %>% group_by(strat) %>% sample_n(5)
However, I have a different specified sample size that I want to sample for each group. I have these sample sizes in a vector nj.
nj <- c(3, 4, 2)
So ideally, I would want 3 observations from my first strata, 4 observations from my second strata and 2 observations from my last srata. I'm not sure if I can sample by group using each unique sample size (without having to write out "sample" however many times I need to)? Thanks in advance!
my.df looks like:
var1 var2 strat
15 3 1
13 5 3
8 6 2
12 70 3
11 10 1
14 4 2
Upvotes: 2
Views: 603
Reputation: 26218
Since your data is inadequate for sampling, let us consider this example on iris
dataset
library(tidyverse)
nj <- c(3, 5, 6)
set.seed(1)
iris %>% group_split(Species) %>% map2_df(nj, ~sample_n(.x, size = .y))
# A tibble: 14 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 4.6 3.1 1.5 0.2 setosa
2 4.4 3 1.3 0.2 setosa
3 5.1 3.5 1.4 0.2 setosa
4 6 2.7 5.1 1.6 versicolor
5 6.3 2.5 4.9 1.5 versicolor
6 5.8 2.6 4 1.2 versicolor
7 6.1 2.9 4.7 1.4 versicolor
8 5.8 2.7 4.1 1 versicolor
9 6.4 2.8 5.6 2.2 virginica
10 6.9 3.2 5.7 2.3 virginica
11 6.2 3.4 5.4 2.3 virginica
12 6.9 3.1 5.1 2.3 virginica
13 6.7 3 5.2 2.3 virginica
14 7.2 3.6 6.1 2.5 virginica
Upvotes: 1
Reputation: 193507
You can use stratified
from my "splitstackshape" package.
Here's some sample data:
set.seed(1)
my.df <- data.frame(var1 = sample(100, 20, TRUE),
var2 = runif(20),
strat = sample(3, 20, TRUE))
table(my.df$strat)
#
# 1 2 3
# 5 9 6
Here's how you can use stratified
:
library(splitstackshape)
# nj needs to be a named vector
nj <- c("1" = 3, "2" = 4, "3" = 2)
stratified(my.df, "strat", nj)
# var1 var2 strat
# 1: 72 0.7942399 1
# 2: 39 0.1862176 1
# 3: 50 0.6684667 1
# 4: 21 0.2672207 2
# 5: 69 0.4935413 2
# 6: 91 0.1255551 2
# 7: 78 0.4112744 2
# 8: 7 0.3403490 3
# 9: 27 0.9347052 3
table(.Last.value$strat)
#
# 1 2 3
# 3 4 2
Upvotes: 2
Reputation: 388807
You can bring nj
values to sample in the dataframe and then use sample_n
by group.
library(dplyr)
df %>%
mutate(nj = nj[strat]) %>%
group_by(strat) %>%
sample_n(size = min(first(nj), n()))
Note that the above works because strat
has value 1, 2, 3. For a general solution when the group does not have such values you could use :
df %>%
mutate(nj = nj[match(strat, unique(strat))]) %>%
group_by(strat) %>%
sample_n(size = min(first(nj), n()))
Upvotes: 0