Sample n random rows per group in a dataframe with dplyr when some observations have less than n rows

Question

I have a data frame with two categorical variables.

samples<-c("A","A","A","A","B","B")
groups<-c(1,1,1,2,1,1)
df<- data.frame(samples,groups)
df
  samples groups
1       A      1
2       A      1
3       A      1
4       A      2
5       B      1
6       B      1

The result that I would like to have is for each given observation (sample-group) to downsample (randomly, this is important) the data frame to a maximum of X rows and keep all obervation for which appear less than X times. In the example here X=2. Is there an easy way to do this? The issue that I have is that observation 4 (A,2) appears only once, thus dplyr sample_n would not work.

desired output

  samples groups
1       A      1
2       A      1
3       A      2
4       B      1
5       B      1

Ronak Shah · Accepted Answer

You can sample minimum of number of rows or x for each group :

library(dplyr)

x <- 2
df %>% group_by(samples, groups) %>% sample_n(min(n(), x))

#  samples groups
#      
#1 A            1
#2 A            1
#3 A            2
#4 B            1
#5 B            1

However, note that sample_n() has been super-seeded in favor of slice_sample but n() doesn't work with slice_sample. There is an open issue here for it.

However, as @tmfmnk mentioned we don't need to call n() here. Try :

df %>% group_by(samples, groups) %>% slice_sample(n = x)

Sample n random rows per group in a dataframe with dplyr when some observations have less than n rows

Answers (2)

Related Questions