discrete proximity to gaussian distribution using numpy

Question

I'm trying to get a discrete proximity to a gaussian distribution for n >= 2.

So let's say if n = 2 than the discrete proximity would be [0.5, 0.5].

when n = 3 than it would be [0.25, 0.5, 0.25]

when n = 4 than it would be [0.125, 0.375, 0.375, 0.125]

you get my point I hope.

The returned discrete proximity array sum should always be 1 as all distributions.

Here is my code:

import numpy as np
import matplotlib.pyplot as plt
import math
import scipy 
from random import randint

def discrete_gauss(n):
    g = [0.5, 0.5]
    f = g
    for x in range(1, n - 1):
        f = np.convolve(f,g)

    if(sum(f) != 1):
        print("The distribution sum is not 1.")
    else:
        return f

Now 'discrete_gauss' works great when I use (1 < n < 68) but when I enter (n > 67), the sum of f is different than 1 (sometimes more sometimes less) and I don't know why. Anyone has any clue?

Sorry for the messy question I tried to keep it short. I'll be happy to clarify things if they are not clear. Thanks.

tel · Accepted Answer

tl;dr

Read this paper about the challenges of using floating point math, and then reconsider your approach.

Solution

Here's an alternative procedure for generating your desired "distribution" that avoids the floating point rounding errors in the summation that np.convolve performs:

import numpy as np
import scipy.special as sps

def discrete_gauss(n):
    f = np.array([sps.comb(n - 1, i, exact=True) for i in range(n)], dtype='O')
    f = np.float64(f)/np.float64(f).sum()

    if not np.allclose(f.sum(), 1.0):
        raise ValueError("The distribution sum is not close to 1.
" 
                         "f.sum(): %s" % f.sum())

    return f

Explanation of solution

The sequence you want is equivalent to the nth level of Pascal's triangle (see figure at the top of the Wiki on Binomial theorem), normalized so that it can represent a probability. The above solution uses standard Python int values (which are arbitrary precision by default in Python 3) to find the values in the nth level, then switches to floating point math only at the very end for the normalization step (ie np.float64(f)/np.float64(f).sum()).

Note the use of not np.allclose(f.sum(), 1.0) in the check above, instead of f.sum() != 1.0. As discussed below in the Deeper dive section, f.sum() will be equal to 1.0 for ~90% of the values of n from 1-1000. However, in general you cannot assume that the result of a floating point computation will exactly match the result you'd get from an equivalent computation using real numbers (see this paper for all of the gory details). When dealing with floats you should usually (by which I mean almost always) check that a result is close (ie equal to within a given tolerance/error) to your expected value, rather than equal to it.

Deeper dive

This solution isn't perfect. Most values of n produce results that exactly sum to 1.0, but some don't. The following code checks the results of discrete_gauss(n) for values of n from 1-1000:

nnot1 = []
for n in range(1,1001):
    if discrete_gauss(n).sum() != 1.0:
        nnot1.append(n)

print('discrete_gauss(n).sum() was not equal to 1.0 for %d values of n.' % len(nnot1))
print(nnot1)

Output:

discrete_gauss(n).sum() was not equal to 1.0 for 75 values of n.
[78, 89, 110, 114, 125, 127, 180, 182, 201, 206, 235, 248, 273, 342, 346, 348, 365, 373, 383, 390, 402, 403, 421, 427, 429, 451, 454, 471, 502, 531, 540, 556, 558, 574, 579, 584, 587, 595, 600, 609, 617, 631, 633, 647, 648, 651, 657, 669, 674, 703, 705, 728, 731, 763, 765, 772, 778, 783, 798, 816, 837, 852, 858, 860, 861, 867, 874, 877, 906, 912, 941, 947, 959, 964, 972]

Thus, for ~8% of these values, dicrete_gauss(n).sum() was not exactly equal to 1.0. However, since no error was raised, np.allclose(dicrete_gauss(n).sum(), 1.0) was always True.

Notes

scipy.speical.comb(n, k, exact=True) gives the (n, k)th binomial coefficient as a standard Python int, which is equivalent to the kth value in the nth level of Pascal's triangle.