Reputation: 722
I have reached a point in my code where I iteratively calculate the pvalues of some conditions:
from scipy.stats import hypergeom
pval = min(hypergeom.sf(k, M, n, N) + hypergeom.pmf(k, M, n, N), 1)
this method works for "small" n's (number of successful elements in the pop). I tried up to 500.
After I tried with n=5000
I get a precision error because the calculated pvalue is very low and is rounded to 0.
How can I overcome these precision errors
in Python?
Upvotes: 0
Views: 632
Reputation: 114946
The values that you want to compute are smaller than can be represented using 64 bit floating point values. One example you give in a comment is k = 5007, M = 45956, n = 18969, N = 5267
. For those values of M
, n
and N
, the PMF underflows to 0 when the k
argument is 3478:
In [46]: k = 5007
In [47]: M = 45956
In [48]: n = 18969
In [49]: N = 5267
In [50]: hypergeom.pmf(3476, M, n, N)
Out[50]: 9.8813129168249309e-324
In [51]: hypergeom.pmf(3477, M, n, N)
Out[51]: 4.9406564584124654e-324
In [52]: hypergeom.pmf(3478, M, n, N)
Out[52]: 0.0
The standard method to work around that problem is to work with the logarithms of the probabilities. The scipy discrete distributions have the functions logpmf
and logsf
for this:
In [53]: hypergeom.logpmf(3476, M, n, N)
Out[53]: -743.80749253381509
In [54]: hypergeom.logpmf(3477, M, n, N)
Out[54]: -744.95722489454783
In [55]: hypergeom.logpmf(3478, M, n, N)
Out[55]: -746.10790755529888
In [56]: hypergeom.logpmf(5007, M, n, N)
Out[56]: -3952.1782915849763
To compute hypergeom.sf(k, M, n, N) + hypergeom.pmf(k, M, n, N)
, you can use numpy.logaddexp
:
In [58]: np.logaddexp(hypergeom.logsf(k, M, n, N), hypergeom.logpmf(k, M, n, N))
Out[58]: -3952.1508002445375
The only inconvenience is that further calculations and comparisons must be based on the logarithm of the probability. If that doesn't work for you, you'll have to switch to a library that provides higher precision floating point calculations, such as mpmath
. For example, the following functions use mpmath
to compute the PMF and the survival function:
def hypergeom_pmf(k, M, n, N):
tot, good = M, n
bad = tot - good
pmf = (mpmath.beta(good+1, 1) * mpmath.beta(bad+1,1) * mpmath.beta(tot-N+1, N+1) /
(mpmath.beta(k+1, good-k+1) * mpmath.beta(N-k+1,bad-N+k+1) * mpmath.beta(tot+1, 1)))
return pmf
def hypergeom_sf(k, M, n, N):
sf = (mpmath.binomial(N, k+1) * mpmath.binomial(M-N, n - k - 1) / mpmath.binomial(M, n) *
mpmath.hyp3f2(1, k + 1 - n, k + 1 - N, k + 2, M + k + 2 - n - N, 1))
return sf
(The expressions used in hypergeom_pmf(k, M, n, N)
were taken from scipy's implementation in scipy.stats.hypergeom._logpmf
. hypergeom_sf
uses the formula for the CDF given on the wikipedia page on the hypergeometric distribution. It is not necessarily the best implementation of the survival function.)
For example:
In [107]: import mpmath
In [108]: mpmath.mp.dps = 40
In [109]: k, M, n, N
Out[109]: (5007, 45956, 18969, 5267)
In [110]: hypergeom_pmf(k, M, n, N)
Out[110]: mpf('3.897413335837289136238051958307757561884655e-1717')
In [111]: hypergeom_sf(k, M, n, N)
Out[111]: mpf('1.086314878026431217760059547783856962636701e-1718')
Upvotes: 3