Dplyr downsample in pipeline

Question

I have a tibble like so:

tibble(a = c(1,2,3,4,5), b = c(1,1,1,2,2))

I want to randomly downsample the data by the "b" column, like so:

tibble(a = c(1,3,4,5), b = c(1,1,2,2))

How can I do this entirely in a Dplyr pipeline without changing the data type of the tibble?

IceCreamToucan · Accepted Answer

This gets the smallest group size (grouped by b), and samples that many elements from each group. Not clear if that's what you wanted.

If your tibble is called df

df %>% 
  group_by(b) %>% 
  add_count %>% 
  slice(sample(row_number(), min(.$n))) %>% 
  select(-n)

Answers (1)