Reputation: 4636
Given myletters
:
library(tidyverse)
myletters <- letters
myletters
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
I want to sample 4 letters at a time (without replacement) from myletters
, repeat this X
multiple times and find the probability of having sampled all letters at least once in X = 1:100
draws.
For example if X = 10
we could get:
set.seed(10)
X <- unlist(rerun(10, sample(myletters, 4, replace = F)))
X
# [1] "k" "i" "j" "p" "l" "w" "h" "v" "g" "s" "x" "o" "o" "j" "g" "y" "b" "x" "m" "h" "n" "g" "f" "y" "v" "r" "u" "y" "m" "e" "a" "g" "z" "r" "d" "y" "x" "s" "v"
# [40] "r"
#test if X contains all 26 letters
n_distinct(X) == 26 #26 = no of letters
#FALSE
The following approach does what I want in a simulation but doesn't scale very well as it fills a dataframe column with up to 400 letters in a cell so is awkward and inefficient:
output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(myletters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))
output
#plot
output %>%
group_by(drawsX) %>%
summarise(prob_of_all_letters = mean(all_letters)) %>%
ggplot(., aes(drawsX, prob_of_all_letters)) +
geom_line() +
scale_y_continuous(labels = scales::percent_format()) +
labs(y = "Probability")
Ideally I would like to simulate more times e.g. trial = 1:100000
but the approach above is inefficient if I wanted to do this.
1) Is there a more efficient way to fill my dataset (or using a matrix) with samples?
2) Also, is there an analytic way to solve this problem in R instead of simulation. e.g. what is probability of get 26 letters from 10 draws of 4 samples each?
thanks
Upvotes: 0
Views: 56
Reputation: 145805
Here's a somewhat improved version. The code is a bit more efficient and certainly cleaner:
sample_sets = function(replicates, k, set = letters) {
draws = vapply(1:replicates, function(z, ...) sample.int(...), FUN.VALUE = integer(k), n = length(set), size = k, replace = FALSE)
all(seq_along(set) %in% draws)
}
## example use
output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
## timing
system.time({output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
})
# user system elapsed
# 2.79 0.04 2.95
## original way
system.time({output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(letters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))})
# user system elapsed
# 4.96 0.06 5.18
So it's about 40% faster on this data - hopefully that performance gain will continue as draws
increases.
Upvotes: 1