user63230
user63230

Reputation: 4636

improve efficiency of filling dataset with sampling

Given myletters:

library(tidyverse)
myletters <- letters
myletters
#  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

I want to sample 4 letters at a time (without replacement) from myletters, repeat this X multiple times and find the probability of having sampled all letters at least once in X = 1:100 draws.

For example if X = 10 we could get:

set.seed(10)
X <- unlist(rerun(10, sample(myletters, 4, replace = F)))
X
#  [1] "k" "i" "j" "p" "l" "w" "h" "v" "g" "s" "x" "o" "o" "j" "g" "y" "b" "x" "m" "h" "n" "g" "f" "y" "v" "r" "u" "y" "m" "e" "a" "g" "z" "r" "d" "y" "x" "s" "v"
# [40] "r"

#test if X contains all 26 letters
n_distinct(X) == 26 #26 = no of letters
#FALSE

The following approach does what I want in a simulation but doesn't scale very well as it fills a dataframe column with up to 400 letters in a cell so is awkward and inefficient:

output <- crossing(drawsX = 1:100,
                       trial = 1:100) %>%
  mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(myletters, 4, replace = F)))),
         all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))
output

#plot
output %>%
  group_by(drawsX) %>%
  summarise(prob_of_all_letters = mean(all_letters)) %>% 
  ggplot(., aes(drawsX, prob_of_all_letters)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(y = "Probability")

Ideally I would like to simulate more times e.g. trial = 1:100000 but the approach above is inefficient if I wanted to do this.

1) Is there a more efficient way to fill my dataset (or using a matrix) with samples?

2) Also, is there an analytic way to solve this problem in R instead of simulation. e.g. what is probability of get 26 letters from 10 draws of 4 samples each?

thanks

Upvotes: 0

Views: 56

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145805

Here's a somewhat improved version. The code is a bit more efficient and certainly cleaner:

sample_sets = function(replicates, k, set = letters) {
  draws = vapply(1:replicates, function(z, ...) sample.int(...), FUN.VALUE = integer(k), n = length(set), size = k, replace = FALSE)
  all(seq_along(set) %in% draws)
}

## example use
output <- crossing(
    drawsX = 1:100,
    trial = 1:100
  ) %>%
  mutate(
    outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
  )

## timing
system.time({output <- crossing(
    drawsX = 1:100,
    trial = 1:100
  ) %>%
  mutate(
    outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
  )
})
# user  system elapsed 
# 2.79    0.04    2.95 


## original way
system.time({output <- crossing(drawsX = 1:100,
                       trial = 1:100) %>%
  mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(letters, 4, replace = F)))),
         all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))})
# user  system elapsed 
# 4.96    0.06    5.18 

So it's about 40% faster on this data - hopefully that performance gain will continue as draws increases.

Upvotes: 1

Related Questions