Josh
Josh

Reputation: 59

Discrepancy using Python for hypergeometric distribution

I have a question regarding hypergeometric distribution computation using Python. Say I have a bag of 15 balls where 5 are blue and 10 are red. If I randomly pick 5 balls out, what are the odds that I pick exactly 4 blues out of the 5? If I do it in Python using simulation here is the code:

 import numpy as np
 balls=['blue']*5+['red']*10
 count=0
 for i in range(10000):
     pick=np.random.choice(balls, 5)
     if list(pick).count('blue')==4:
         count+=1
 odds=count/10000
 print(odds)

I get around 0.04. But if I use scipy.stats, I get a different number. The code is very simple.

from scipy import stats
odds=stats.hypergeom.pmf(4, 15, 5, 5)
print(odds)

I get 0.016. So why are these two different?

Upvotes: 1

Views: 301

Answers (1)

Warren Weckesser
Warren Weckesser

Reputation: 114811

To match the hypergeometric distribution, you must pick the 5 balls without replacement. Change the generation of pick to

pick = np.random.choice(balls, 5, replace=False)

Upvotes: 1

Related Questions