About how numpy.random.choice works

Question

I'm trying to writing a program that simulate the Coupon collector's problem. Here is quick reference for the problem:

Given n coupons, how many coupons do you expect you need to draw with replacement before having drawn each coupon at least once.

I come up with two pieces of code:

(1)

n = 50
coupon = np.arange(0, n)
def collect(coupon):
    number = 49
    collection = np.array([])
    while len(set(collection)) != n:
          number +=1
          collection = np.random.choice(coupon, replace = True, size = number)
    return number

And got the result for 10 iterations of collect(coupon) with average value is :

[175. 151. 128. 132. 169. 118. 134. 138. 150. 135.]
143.0

(2)

n = 50
coupon = np.arange(0,n)
def collect(coupon):
    collection = set()
    number = 0
    while len(collection) != n:
          number +=1
          got = np.random.choice(coupon)
          collection.add(got)
    return number

The result for running 10 iteration of collect(coupon) with average value is:

[184, 119, 286, 196, 172, 370, 163, 267, 238, 199]
219.4

I have tried for large number of interations, Code (1) and code (2) produce very different results.

I know that the correct answer for expected value to collect all 50 coupons is 225 and code (2) is the right one. On the other hand, I can't find a reasonable explanation for why code (1) fails ? Why numpy.random.choice is'nt work in this example ?

tel · Accepted Answer

Numpy problems (or not)

Your code seems fine, including your use of np.random.choice. One possible spot of confusion you might be having relates to the default value of the replace parameter. replace defaults to True, so you don't need to explicitly pass replace = True to choice in code block (1).

Aside from that very minor issue, there's no glaring problems with your code. Thus, the trouble may be in the math and probability instead.

Probability issues

225 is the expected value when the draw size is 1. In your code (1), the draw size increases with every iteration. Think about it this way: as the draw size increases, the odds of getting all of the coupons in a single draw start becoming reasonably large. I don't know the exact number (EDIT: I found some exact numbers. They're now in the "Deeper investigation" section below), but say that the probability of getting all 50 coupons in a single draw of 100 is .01. By the time you get to draw 143, the cumulative odds of having gotten all of the coupons at least once should be at least .4 (and likely greater).

Deeper investigation

With a few tweaks, your code can be used to estimate the odds of seeing all 50 coupons in a single draw of size x:

def collectx(coupon, x, reps):
    x = np.asarray(x)
    n = coupon.size

    counts = np.zeros((x.size, reps), dtype=int)
    for i,xsub in enumerate(x):
        for j in range(reps):
            count = 1
            while np.unique(np.random.choice(coupon, size=xsub)).size < n:
                count += 1
            counts[i, j] = count

    return counts

The probabilities for many different x values can be estimated at once by passing in a sequence. Now, to estimate the probabilities for all draw sizes 120-143 you can run:

n = 50
coupon = np.arange(0, n)
counts = collectx(coupon, np.arange(120,144), 100)

This yields counts, which is an array of shape (24, 100). The draw size varies over the rows, and the replicate id of the estimate varies over the columns. You can get the "see every coupon in a single draw" probabilities by taking the mean across the estimate replicates and dividing the result into 1:

probx = (1/counts.mean(axis=1))

which looks like:

[0.00418971 0.00563    0.00661288 0.00694493 0.00690799 0.00854774
 0.00909339 0.01050531 0.01207875 0.01344086 0.01485222 0.0155642
 0.02004008 0.02115059 0.02015723 0.02377556 0.02639916 0.02379819
 0.02856327 0.03941663 0.04145937 0.03162555 0.03601008 0.04821601]

For a draw size of 120, the probability is still below .005, but it increases rapidly, and at a draw size of 143 the probability of seeing every coupon is nearly .05. Since the probabilities for draw sizes below 120 are small, summing over the probabilities of draw sizes 120-143 gives a reasonable estimate of the cumulative probability of your code block (1) having seen all coupons in a single draw by the time its finished with draw size 143:

print('%.3f' % probx.sum())

Output:

0.475

So your code block (1) is as likely as not to have seen every coupon when n==143. Which is completely consistent with the results you observed. So no Numpy problem, just a probability one.

About how numpy.random.choice works

Answers (1)

Numpy problems (or not)

Probability issues

Deeper investigation

Related Questions