Aaarrrgh's My Game
Aaarrrgh's My Game

Reputation: 119

R (and dplyr?) - Sampling from a dataframe by group, up to a maximum sample size of n

I have a dataframe which contains multiple samples (1-n) per group. I would like to sample this dataset, without replacement, so that I have a maximum of 5 samples per group (1-5).

This problem has previously been described and answered here. In this question @evolvedmicrobe's answer was the most satisfactory for me and has worked in the past. This seems to have broken in the last year or so.

Here is a workable example of what I would like to do:

From mtcars, there are different numbers of rows when grouped by "cyl".

table(mtcars$cyl)
 4  6  8 
11  7 14 

I would like to create a sub-sample where the maximum number of cars per group cyl is ten. The resulting number of rows would theoretically look like:

table(subsample$cyl)
 4  6  8
10  7 10

My naive attempt at this was:

library(dplyr)
subsample <- mtcars %>% group_by(cyl) %>% sample_n(10) %>% ungroup()

However, because one group has fewer than 10 rows:

Error: size must be less or equal than 7 (size of data), set replace = TRUE to use sampling with replacement

@evolvedmicrobe's answer to this was to create a custom sampling function:

### Custom sampler function to sample min(data, sample) which can't be done with dplyr
 ### it's a modified copy of sample_n.grouped_df
 sample_vals <- function (tbl, size, replace = FALSE, weight = NULL, .env = parent.frame()) 
 {
   #assert_that(is.numeric(size), length(size) == 1, size >= 0)
   weight <- substitute(weight)
   index <- attr(tbl, "indices")
   sizes = sapply(index, function(z) min(length(z), size)) # here's my contribution
   sampled <- lapply(1:length(index), function(i) dplyr:::sample_group(index[[i]],  frac = FALSE, tbl = tbl, 
                                       size = sizes[i], replace = replace, weight = weight, .env = .env))
   idx <- unlist(sampled) + 1
   grouped_df(tbl[idx, , drop = FALSE], vars = groups(tbl))
 }

 samped_data = dataset %>% group_by(something) %>% sample_vals(size = 50000) %>% ungroup()

This function has worked in the past, I've just tried re-running it but it no longer works, instead, it throws back the same error as it currently does for the mtcars example:

library(dplyr)
subsample <- mtcars %>% group_by(cyl) %>% sample_vals(10) %>% ungroup()

Error in dplyr:::sample_group(index[[i]], frac = FALSE, tbl = tbl, size = sizes[i], : unused argument (tbl = tbl) Called from: FUN(X[[i]], ...)

Has anyone got a better way of sampling by group, without replacement, up to a maximum size per group? I'm not ordinarily a big user of dplyr, so all options from base R or other packages are also welcome.

Otherwise, does anyone have an idea why the previous work-around has stopped working?

Thanks for everyone's time.

Upvotes: 5

Views: 2668

Answers (4)

kath
kath

Reputation: 7724

For a simple function, you can use this workaround, which first blows up the group with not enough samples and then filters them out at the end:

library(dplyr)
library(tidyr)

size <- 10

subsample <- mtcars %>% 
  group_by(cyl) %>% 
  mutate(group_count = n(), 
         group_count_along = 1:n()) %>% 
  ungroup() %>% 
  complete(cyl, group_count_along) %>% 
  group_by(cyl) %>% 
  filter(group_count_along <= max(group_count, size, na.rm = T)) %>% 
  sample_n(size) %>% 
  ungroup() %>% 
  filter(group_count_along <= group_count)

table(subsample$cyl)
 4  6  8 
10  7 10 

Upvotes: 1

lebatsnok
lebatsnok

Reputation: 6449

It is quite straightforward with base R as well, for example:

do.call(rbind, lapply(split(mtcars, mtcars$cyl), function(x) {
  n <- nrow(x)
  s <- min(n, 10)
  x[sample(seq_len(n), s),]
}))

The rows in the output will be sorted by cyl -- but row order would probably not matter anyway.

Upvotes: 2

Shree
Shree

Reputation: 11140

Here's a simple solution using slice -

samples_per_group <- 10

subsample <- mtcars %>%
  group_by(cyl) %>%
  slice(sample(n(), min(samples_per_group, n()))) %>%
  ungroup()

table(subsample$cyl)

#  4  6  8 
# 10  7 10

Upvotes: 6

tokami
tokami

Reputation: 156

The function sample_group has been updated and the arguments tbl and .env have been removed. Removing these arguments from your sample_vals function and getting rid of the +1 restores the functionality of your function.

require(dplyr)

sample_vals <- function (tbl, size, replace = FALSE, weight = NULL){
    ## assert_that(is.numeric(size), length(size) == 1, size >= 0)
    weight <- substitute(weight)
    index <- attr(tbl, "indices")
    sizes <- sapply(index, function(z) min(length(z), size)) # here's my contribution
    sampled <- lapply(1:length(index),
                      function(i) dplyr:::sample_group(index[[i]],  frac = FALSE, 
                                                       size = sizes[i],
                                                       replace = replace,
                                                       weight = weight))
    idx <- unlist(sampled) ## + 1
    grouped_df(tbl[idx, , drop = FALSE], vars = groups(tbl))
}

samped_data <- mtcars %>% group_by(cyl) %>% sample_vals(size = 10) %>% ungroup()

table(samped_data$cyl)

Upvotes: 1

Related Questions