Reputation: 762
I am trying to randomly sample n times a given grouped variable, but the n varies by the group. For example:
library(dplyr)
iris <- iris %>% mutate(len_bin=cut(Sepal.Length,seq(0,8,by=1))
I have these factors, which are my grouped variable:
table(iris$len_bin)
(4,5] (5,6] (6,7] (7,8]
32 57 49 12
Is there a way to randomly sample only these groups n times, n being the number of times each element is present in this vector:
x <- c("(4,5]","(5,6]","(5,6]","(5,6]","(6,7]")
The result should look like:
# Groups: len_bin [4]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species len_bin
<dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 5 2 3.5 1 versicolor (4,5]
2 5.3 3.7 1.5 0.2 setosa (5,6]
2 5.3 3.7 1.5 0.2 setosa (5,6]
2 5.3 3.7 1.5 0.2 setosa (5,6]
3 6.5 3 5.8 2.2 virginica (6,7]
I managed to do this with a for loop and using sample_n() based on the vector. I am assuming there must be a faster way. Can I define n within sample_n() for example?
Upvotes: 1
Views: 102
Reputation: 51994
In base R you can do:
iris <- iris %>% mutate(len_bin = cut(Sepal.Length, seq(4, 8, by = 1))
x <- c("(4,5]","(5,6]","(5,6]","(5,6]","(6,7]")
l <- mapply(\(x, y) x[sample(nrow(x), y), ],
split(iris, iris$len_bin),
c(table(factor(x, levels = levels(iris$len_bin)))),
SIMPLIFY = F)
do.call(rbind.data.frame, l)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species len_bin
#(4,5] 5.0 3.2 1.2 0.2 setosa (4,5]
#(5,6].17 5.4 3.9 1.3 0.4 setosa (5,6]
#(5,6].63 6.0 2.2 4.0 1.0 versicolor (5,6]
#(5,6].97 5.7 2.9 4.2 1.3 versicolor (5,6]
#(6,7] 6.9 3.1 5.1 2.3 virginica (6,7]
Upvotes: 1