Reputation: 2704
Given a dataframe df
with a column called group
, how do you randomly sample k
groups from it in dplyr? It should return all rows from k
groups (given there are at least k
unique values in df$group
), and every group in df
should be equally likely to be returned.
Upvotes: 27
Views: 11726
Reputation: 11
I too had issues with Oscar's code using nest. But when I updated to the latest syntax of nest(), unnest(), and slice_sample() it worked.
Below is an alternate version that will produce the same answers, if the input frame is arranged by the group variable. Otherwise the answers will be just as good on the average. This version has a couple advantages over the nest version: 1. The final data frame has columns in the original order; in contrast the nest version puts the grouping variable first. 2: The intermediate results are a lot easier to read when you are debugging, since they are plain old lists.
I am interested in sampling the original number of groups with replacement, as in clustered bootstrapping. One could easily add more parameters to make the function more general.
# function to compute a clustered bootstrap sample
samplebygroups <- function(df, groupvar){
datalist <- df %>%
group_by({{ groupvar }}) %>%
group_split
n <- length(datalist)
samplegroups <- sample(n, replace = TRUE)
datalist[samplegroups] %>%
bind_rows
}
Here is a sample run
smallcars <- mtcars %>%
rownames_to_column(var = "Model") %>%
tail(5) %>%
arrange(cyl) %>%
select(Model, cyl, mpg)
set.seed(1000)
samplebygroups(smallcars, cyl)
with output
# A tibble: 5 x 3
Model cyl mpg
<chr> <dbl> <dbl>
1 Ford Pantera L 8 15.8
2 Maserati Bora 8 15
3 Ferrari Dino 6 19.7
4 Ford Pantera L 8 15.8
5 Maserati Bora 8 15
You would get exactly the same rows using Oscar's code, but cyl would be the first column.
Upvotes: 1
Reputation: 908
I really like the approach described by Tristan Mahr here. I've copied his function from the blog for the example below:
library(tidyverse)
sample_n_of <- function(data, size, ...) {
dots <- quos(...)
group_ids <- data %>%
group_by(!!! dots) %>%
group_indices()
sampled_groups <- sample(unique(group_ids), size)
data %>%
filter(group_ids %in% sampled_groups)
}
set.seed(1234)
mpg %>%
sample_n_of(size = 2, model)
#> # A tibble: 12 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a6 qua~ 2.8 1999 6 auto(l~ 4 15 24 p midsi~
#> 2 audi a6 qua~ 3.1 2008 6 auto(s~ 4 17 25 p midsi~
#> 3 audi a6 qua~ 4.2 2008 8 auto(s~ 4 16 23 p midsi~
#> 4 ford mustang 3.8 1999 6 manual~ r 18 26 r subco~
#> 5 ford mustang 3.8 1999 6 auto(l~ r 18 25 r subco~
#> 6 ford mustang 4 2008 6 manual~ r 17 26 r subco~
#> 7 ford mustang 4 2008 6 auto(l~ r 16 24 r subco~
#> 8 ford mustang 4.6 1999 8 auto(l~ r 15 21 r subco~
#> 9 ford mustang 4.6 1999 8 manual~ r 15 22 r subco~
#> 10 ford mustang 4.6 2008 8 manual~ r 15 23 r subco~
#> 11 ford mustang 4.6 2008 8 auto(l~ r 15 22 r subco~
#> 12 ford mustang 5.4 2008 8 manual~ r 14 20 p subco~
Created on 2021-03-24 by the reprex package (v0.3.0)
Upvotes: 4
Reputation: 4907
Take note that using dplyr
is considerably slower than regular data frame operations:
library(microbenchmark)
microbenchmark(dplyr= iris %>% filter(Species %in% sample(levels(Species),2)),
base= iris[iris[["Species"]] %in% sample(levels(iris[["Species"]]), 2),])
Unit: microseconds
expr min lq mean median uq max neval cld
dplyr 660.287 710.655 753.6704 722.629 771.2860 1122.527 100 b
base 83.629 95.032 110.0936 106.057 119.1715 199.949 100 a
Note [[
is known to be faster than $
, although both work
Upvotes: 3
Reputation: 369
I think this approach makes the most sense if you are using dplyr:
iris_grouped <- iris %>%
group_by(Species) %>%
nest()
Which produces:
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica <tibble [50 × 4]>
with which you can then use sample_n
:
iris_grouped %>%
sample_n(2)
# A tibble: 2 x 2
Species data
<fct> <list>
1 virginica <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
Upvotes: 12
Reputation: 206536
Just use sample()
to choose some number of groups
iris %>% filter(Species %in% sample(levels(Species),2))
Upvotes: 42