Christopher Costello
Christopher Costello

Reputation: 1256

Dplyr downsample in pipeline

I have a tibble like so:

tibble(a = c(1,2,3,4,5), b = c(1,1,1,2,2))

I want to randomly downsample the data by the "b" column, like so:

tibble(a = c(1,3,4,5), b = c(1,1,2,2))

How can I do this entirely in a Dplyr pipeline without changing the data type of the tibble?

Upvotes: 0

Views: 883

Answers (1)

IceCreamToucan
IceCreamToucan

Reputation: 28675

This gets the smallest group size (grouped by b), and samples that many elements from each group. Not clear if that's what you wanted.

If your tibble is called df

df %>% 
  group_by(b) %>% 
  add_count %>% 
  slice(sample(row_number(), min(.$n))) %>% 
  select(-n)

Upvotes: 3

Related Questions