glongo_fishes
glongo_fishes

Reputation: 153

dplyr filter by group conditionally with if else

After using group_by in dplyr I would like to use filter to sample all rows in a group if there are fewer than x rows while if there are more than x rows I would like to sub-sample a specific number of rows from those groups. I'll illustrate with the diamonds data set when grouped by clarity.

diamonds %>%
    group_by(clarity) %>%
    summarise(count = n())
# A tibble: 8 x 2
  clarity count
  <ord>   <int>
1 I1        741
2 SI2      9194
3 SI1     13065
4 VS2     12258
5 VS1      8171
6 VVS2     5066
7 VVS1     3655
8 IF       1790

Using this example I want to sample all rows in clarity groups if they have 5066 or fewer rows while in groups with more than 5066 rows I would like to use sample_n without replacement to randomly sample 5000 rows. sample_n without replacement will only work if size is equal to or less than the number of rows in the smallest group. I'm stuck after trying a number of things but here is an example of my thought process.

diamonds %>%
  group_by(clarity) %>%
  if_else(n() > 5066, sample_n(size = 5000, replace = F), filter())

I'm pretty new to dplyr and still getting familiar with R in general. I'm sure this is something relatively easy but I did not see a clear solution posted. Thanks in advance!

edit:

Pretty much I would like the output from the following code but in one line of code.

# groups below or equal to 5066
low_sample_groups <- diamonds %>%
  group_by(clarity) %>% 
  filter( n() <= 5066)

# groups above 5066
high_sample_groups <- diamonds %>%
  group_by(clarity) %>% 
  filter( n() > 5066) %>%
  sample_n(size = 5000, replace = F)

desired_result <- full_join(low_sample_groups, high_sample_groups)

edit round 2

found the answer I was looking for here: custom grouped dplyr function (sample_n)

Essentially this is the solution using an if statement

n <- 5066
desired_result <- diamonds %>%
  group_by(clarity) %>% 
  sample_n(if(n() < n) n() else n)

Upvotes: 2

Views: 3214

Answers (1)

A. Suliman
A. Suliman

Reputation: 13125

We can start by split "group" the data frame by required variable then apply "map" a conditional sampling based on the number of observations within each group.

diamonds %>% split(.$clarity) %>% 
             map(function(x) if (nrow(x) <= 5066) sample_n(size = nrow(x), replace = F,x) else sample_n(size = 5000, replace = F,x)) %>% bind_rows()

More concisely

Sample_FUN <- function(x){
              if (nrow(x) <= 5000) sample_n(size = nrow(x), replace = F,x) 
              else sample_n(size = 5066, replace = F,x)
              } 

diamonds %>% split(.$clarity) %>% map(Sample_FUN) %>% bind_rows()

Upvotes: 3

Related Questions