Reputation: 1256
I have a tibble like so:
tibble(a = c(1,2,3,4,5), b = c(1,1,1,2,2))
I want to randomly downsample the data by the "b" column, like so:
tibble(a = c(1,3,4,5), b = c(1,1,2,2))
How can I do this entirely in a Dplyr pipeline without changing the data type of the tibble?
Upvotes: 0
Views: 883
Reputation: 28675
This gets the smallest group size (grouped by b
), and samples that many elements from each group. Not clear if that's what you wanted.
If your tibble is called df
df %>%
group_by(b) %>%
add_count %>%
slice(sample(row_number(), min(.$n))) %>%
select(-n)
Upvotes: 3