thothal
thothal

Reputation: 20379

sample(.) versus calculated probabilities

Where is my mistake in the following code/reasoning.

If I have n items and I want to count the number of combinations (i.e. order does not matter) of draws of size k with replacement I can use the binomial coefficient:

choose(n + k - 1, k)

When I want to count the number of possibilities where element i is not part of the draw I would thus use

choose(n + k - 1 - 1, k)

(Basically I have one option less to draw from).

E.g. with n=3, k= 3 I have the following 10 (choose(3 + 3 - 1, 3)) draws:

# AAA, AAB, ABB, BBB, BBC, BCC, CCC, CCA, CAA, ABC 

The ones where element 'A' (say) does not appear, are the following 4 (choose(3 + 3 - 1 - 1, 3)):

# BBB, BBC, BCC, CCC

So far so good. I can thus calculate the probability that I have a draw with n = k where element i is not appearing:

freeFromCounts <- function(n) choose(2 * n - 2, n)
totalCounts    <- function(n) choose(2 * n - 1, n)
ratio          <- function(n) freeFromCounts(n) / totalCounts(n) ## (n - 1) / (2 * n - 1)

So here is my problem if I simulate the draws (with sample(.)) and repeat that n.rep times, I would expect to see about n.rep * ratio(n) draws where element 1 (say) is not present. This is, however, not the case. Where is my mistake?

sim <- function(n, n.rep = 10000, x0 = 1) {
    sum(replicate(n.rep, {
       s <- sample(n, n, TRUE)
       all(s != x0)
    })) / n.rep
}

set.seed(12)
sim(10, 1e6)  # [1] 0.348278
ratio(10)     # [1] 0.4736842

Upvotes: 2

Views: 64

Answers (1)

thothal
thothal

Reputation: 20379

A colleague found the solution. sample samples not unordered but ordered, i.e. it make a difference between [1, 2, 3] and [3, 2, 1]. With this knowledge the figures are the same: 9 ^ 10 / 10 ^ 10 = 0.3486784.

So to do the simulation for the original problem, you have to come up with a variant of sample which does unordered sampling. Idea is following the famous derivation of the formula for unordered sampling with replacement (cf. e.g. http://mathworld.wolfram.com/Multichoose.html)

sampleUnordered <- function(n, k) {
  aS <- rep(0, n + k -1)
  aS[!seq_along(aS) %in% sample(n + k - 1, k)] <- seq.int(n-1)
  rl <- rle(aS)
  zeros <- which(rl$values == 0)
  good <- rl$lengths[zeros]
  nr <- c(rl$values, n)[zeros + 1]
  rep(nr, good)
}

sim <- function(n, n.rep = 10000, x0 = 1) {
    sum(replicate(n.rep, {
       s <- sampleUnordered(n, n)
       all(s != x0)
    })) / n.rep
}

set.seed(12)
sim(10, 1e6) # [1] 0.473234

Upvotes: 1

Related Questions