Reputation: 153
After using group_by
in dplyr I would like to use filter
to sample all rows in a group if there are fewer than x rows while if there are more than x rows I would like to sub-sample a specific number of rows from those groups. I'll illustrate with the diamonds data set when grouped by clarity.
diamonds %>%
group_by(clarity) %>%
summarise(count = n())
# A tibble: 8 x 2
clarity count
<ord> <int>
1 I1 741
2 SI2 9194
3 SI1 13065
4 VS2 12258
5 VS1 8171
6 VVS2 5066
7 VVS1 3655
8 IF 1790
Using this example I want to sample all rows in clarity groups if they have 5066 or fewer rows while in groups with more than 5066 rows I would like to use sample_n
without replacement to randomly sample 5000 rows. sample_n
without replacement will only work if size
is equal to or less than the number of rows in the smallest group. I'm stuck after trying a number of things but here is an example of my thought process.
diamonds %>%
group_by(clarity) %>%
if_else(n() > 5066, sample_n(size = 5000, replace = F), filter())
I'm pretty new to dplyr and still getting familiar with R in general. I'm sure this is something relatively easy but I did not see a clear solution posted. Thanks in advance!
edit:
Pretty much I would like the output from the following code but in one line of code.
# groups below or equal to 5066
low_sample_groups <- diamonds %>%
group_by(clarity) %>%
filter( n() <= 5066)
# groups above 5066
high_sample_groups <- diamonds %>%
group_by(clarity) %>%
filter( n() > 5066) %>%
sample_n(size = 5000, replace = F)
desired_result <- full_join(low_sample_groups, high_sample_groups)
edit round 2
found the answer I was looking for here: custom grouped dplyr function (sample_n)
Essentially this is the solution using an if statement
n <- 5066
desired_result <- diamonds %>%
group_by(clarity) %>%
sample_n(if(n() < n) n() else n)
Upvotes: 2
Views: 3214
Reputation: 13125
We can start by split
"group" the data frame by required variable then apply "map
" a conditional sampling based on the number of observations within each group.
diamonds %>% split(.$clarity) %>%
map(function(x) if (nrow(x) <= 5066) sample_n(size = nrow(x), replace = F,x) else sample_n(size = 5000, replace = F,x)) %>% bind_rows()
More concisely
Sample_FUN <- function(x){
if (nrow(x) <= 5000) sample_n(size = nrow(x), replace = F,x)
else sample_n(size = 5066, replace = F,x)
}
diamonds %>% split(.$clarity) %>% map(Sample_FUN) %>% bind_rows()
Upvotes: 3