Big Dogg
Big Dogg

Reputation: 2704

Randomly sample groups

Given a dataframe df with a column called group, how do you randomly sample k groups from it in dplyr? It should return all rows from k groups (given there are at least k unique values in df$group), and every group in df should be equally likely to be returned.

Upvotes: 27

Views: 11726

Answers (5)

Andrew C Skibo
Andrew C Skibo

Reputation: 11

I too had issues with Oscar's code using nest. But when I updated to the latest syntax of nest(), unnest(), and slice_sample() it worked.

Below is an alternate version that will produce the same answers, if the input frame is arranged by the group variable. Otherwise the answers will be just as good on the average. This version has a couple advantages over the nest version: 1. The final data frame has columns in the original order; in contrast the nest version puts the grouping variable first. 2: The intermediate results are a lot easier to read when you are debugging, since they are plain old lists.

I am interested in sampling the original number of groups with replacement, as in clustered bootstrapping. One could easily add more parameters to make the function more general.

# function to compute a clustered bootstrap sample
samplebygroups <- function(df, groupvar){
  datalist <- df %>%
    group_by({{ groupvar }}) %>%
    group_split
  n <- length(datalist)
  samplegroups <- sample(n, replace = TRUE)
  datalist[samplegroups] %>%
    bind_rows
}

Here is a sample run

smallcars <- mtcars %>%  
  rownames_to_column(var = "Model") %>% 
  tail(5) %>%
  arrange(cyl) %>%
  select(Model, cyl, mpg)

 set.seed(1000)
 samplebygroups(smallcars, cyl)

with output

# A tibble: 5 x 3
  Model            cyl   mpg
  <chr>          <dbl> <dbl>
1 Ford Pantera L     8  15.8
2 Maserati Bora      8  15  
3 Ferrari Dino       6  19.7
4 Ford Pantera L     8  15.8
5 Maserati Bora      8  15  

You would get exactly the same rows using Oscar's code, but cyl would be the first column.

Upvotes: 1

Bryan Shalloway
Bryan Shalloway

Reputation: 908

I really like the approach described by Tristan Mahr here. I've copied his function from the blog for the example below:

library(tidyverse)

sample_n_of <- function(data, size, ...) {
  dots <- quos(...)
  
  group_ids <- data %>% 
    group_by(!!! dots) %>% 
    group_indices()
  
  sampled_groups <- sample(unique(group_ids), size)
  
  data %>% 
    filter(group_ids %in% sampled_groups)
}

set.seed(1234)
mpg %>% 
  sample_n_of(size = 2, model)
#> # A tibble: 12 x 11
#>    manufacturer model   displ  year   cyl trans   drv     cty   hwy fl    class 
#>    <chr>        <chr>   <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr> 
#>  1 audi         a6 qua~   2.8  1999     6 auto(l~ 4        15    24 p     midsi~
#>  2 audi         a6 qua~   3.1  2008     6 auto(s~ 4        17    25 p     midsi~
#>  3 audi         a6 qua~   4.2  2008     8 auto(s~ 4        16    23 p     midsi~
#>  4 ford         mustang   3.8  1999     6 manual~ r        18    26 r     subco~
#>  5 ford         mustang   3.8  1999     6 auto(l~ r        18    25 r     subco~
#>  6 ford         mustang   4    2008     6 manual~ r        17    26 r     subco~
#>  7 ford         mustang   4    2008     6 auto(l~ r        16    24 r     subco~
#>  8 ford         mustang   4.6  1999     8 auto(l~ r        15    21 r     subco~
#>  9 ford         mustang   4.6  1999     8 manual~ r        15    22 r     subco~
#> 10 ford         mustang   4.6  2008     8 manual~ r        15    23 r     subco~
#> 11 ford         mustang   4.6  2008     8 auto(l~ r        15    22 r     subco~
#> 12 ford         mustang   5.4  2008     8 manual~ r        14    20 p     subco~

Created on 2021-03-24 by the reprex package (v0.3.0)

Upvotes: 4

alexwhitworth
alexwhitworth

Reputation: 4907

Take note that using dplyr is considerably slower than regular data frame operations:

library(microbenchmark)
microbenchmark(dplyr= iris %>% filter(Species %in% sample(levels(Species),2)),
               base= iris[iris[["Species"]] %in% sample(levels(iris[["Species"]]), 2),])

Unit: microseconds
  expr     min      lq     mean  median       uq      max neval cld
 dplyr 660.287 710.655 753.6704 722.629 771.2860 1122.527   100   b
  base  83.629  95.032 110.0936 106.057 119.1715  199.949   100  a 

Note [[ is known to be faster than $, although both work

Upvotes: 3

Oscar
Oscar

Reputation: 369

I think this approach makes the most sense if you are using dplyr:

iris_grouped <- iris %>% 
  group_by(Species) %>% 
  nest()

Which produces:

# A tibble: 3 x 2
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica  <tibble [50 × 4]>

with which you can then use sample_n:

iris_grouped %>%
  sample_n(2)

# A tibble: 2 x 2
  Species    data             
  <fct>      <list>           
1 virginica  <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>

Upvotes: 12

MrFlick
MrFlick

Reputation: 206536

Just use sample() to choose some number of groups

iris %>% filter(Species %in% sample(levels(Species),2))

Upvotes: 42

Related Questions