Reputation: 4933
In R, I use the phyper
function to do a hypergeometric test for bioinformatics analysis. However I use a lot of Python code and using rpy2 here is quite slow. So, I started looking for alternatives. It seemed that scipy.stats.hypergeom
had something similar.
Currently, I call phyper
like this:
pvalue <- 1-phyper(45, 92, 7518, 1329)
where 45 is the number of selected items having the property of interest, 92 the number of total items having the property, 7518 the number of non selected items not having the property, and 1329 the total number of selected items.
In R, this yields 6.92113e-13
.
Attempting to do the same with scipy.stats.hypergeom
however yields a completely different result (notice, the numbers are swapped because the function accepts numbers in a different way):
import scipy.stats as stats
pvalue = 1-stats.hypergeom.cdf(45, 7518, 92. 1329)
print pvalue
However this returns -7.3450134863151106e-12, which makes little sense. Notice that I've tested this on other data and I had little issues (same precision up to the 4th decimal, which is enough for me).
So it boils down to these possibilities:
In case of "1", are there other alternatives to phyper
that can be used in Python?
EDIT: As the comments have noted, this is a bug in scipy, fixed in git master.
Upvotes: 6
Views: 3872
Reputation: 66844
From the docs, you could try:
hypergeom.sf(x,M,n,N,loc=0)
: survival function (1-cdf — sometimes more accurate)
Also, I think you might have the values mixed up.
Models drawing objects from a bin. M is total number of objects, n is total number of Type I objects. RV counts number of Type I objects in N drawn without replacement from population.
Therefore, by my reading: x=q
, M=n+m
, n=m
, N=k
.
So I would try:
stats.hypergeom.sf(45,(92+7518),92,1329)
Upvotes: 10