user14703201
user14703201

Reputation:

simple random sampling from groups with specified sample size

So I have a dataframe (my.df) which I have grouped by the variable "strat". Each row consists of numerous variables. Example of what it looks like is below - I've simplified my.df for this example since it is quite large. What I want to do next is draw a simple random sample from each group. If I wanted to draw 5 observations from each group I would use this code:

new_df <- my.df %>% group_by(strat) %>% sample_n(5)

However, I have a different specified sample size that I want to sample for each group. I have these sample sizes in a vector nj.

nj <- c(3, 4, 2)

So ideally, I would want 3 observations from my first strata, 4 observations from my second strata and 2 observations from my last srata. I'm not sure if I can sample by group using each unique sample size (without having to write out "sample" however many times I need to)? Thanks in advance!

my.df looks like:

var1  var2  strat
15     3     1
13     5     3
8      6     2
12     70    3
11     10    1
14     4     2

Upvotes: 2

Views: 603

Answers (3)

AnilGoyal
AnilGoyal

Reputation: 26218

Since your data is inadequate for sampling, let us consider this example on iris dataset

library(tidyverse)

nj <- c(3, 5, 6)
set.seed(1)
iris %>% group_split(Species) %>% map2_df(nj, ~sample_n(.x, size = .y))

# A tibble: 14 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          4.6         3.1          1.5         0.2 setosa    
 2          4.4         3            1.3         0.2 setosa    
 3          5.1         3.5          1.4         0.2 setosa    
 4          6           2.7          5.1         1.6 versicolor
 5          6.3         2.5          4.9         1.5 versicolor
 6          5.8         2.6          4           1.2 versicolor
 7          6.1         2.9          4.7         1.4 versicolor
 8          5.8         2.7          4.1         1   versicolor
 9          6.4         2.8          5.6         2.2 virginica 
10          6.9         3.2          5.7         2.3 virginica 
11          6.2         3.4          5.4         2.3 virginica 
12          6.9         3.1          5.1         2.3 virginica 
13          6.7         3            5.2         2.3 virginica 
14          7.2         3.6          6.1         2.5 virginica 

Upvotes: 1

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193507

You can use stratified from my "splitstackshape" package.

Here's some sample data:

set.seed(1)
my.df <- data.frame(var1 = sample(100, 20, TRUE),
                    var2 = runif(20),
                    strat = sample(3, 20, TRUE))
table(my.df$strat)
# 
# 1 2 3 
# 5 9 6 

Here's how you can use stratified:

library(splitstackshape)
# nj needs to be a named vector
nj <- c("1" = 3, "2" = 4, "3" = 2)
stratified(my.df, "strat", nj)
#    var1      var2 strat
# 1:   72 0.7942399     1
# 2:   39 0.1862176     1
# 3:   50 0.6684667     1
# 4:   21 0.2672207     2
# 5:   69 0.4935413     2
# 6:   91 0.1255551     2
# 7:   78 0.4112744     2
# 8:    7 0.3403490     3
# 9:   27 0.9347052     3

table(.Last.value$strat)
# 
# 1 2 3 
# 3 4 2 

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388807

You can bring nj values to sample in the dataframe and then use sample_n by group.

library(dplyr)

df %>%
  mutate(nj = nj[strat]) %>%
  group_by(strat) %>%
  sample_n(size = min(first(nj), n()))

Note that the above works because strat has value 1, 2, 3. For a general solution when the group does not have such values you could use :

df %>%
  mutate(nj = nj[match(strat, unique(strat))]) %>%
  group_by(strat) %>%
  sample_n(size = min(first(nj), n()))

Upvotes: 0

Related Questions