George Zorikov
George Zorikov

Reputation: 139

Theoretical normal distribution function in scipy

I need to plot normal cumulative distribution for given edges of bins:

bin_edges = np.array([1.02,  4.98,  8.93, 12.89, 16.84, 20.79, 24.75, 28.7])
mean = 15.425
standard_deviation = 6.159900567379315

First I did:

cdf = ((1 / (np.sqrt(2 * np.pi) * standard_deviation)) *
   np.exp(-0.5 * (1 / standard_deviation * (bin_edges - mean))**2))
cdf = cdf.cumsum()
cdf /= cdf[-1]

Another way I found:

cdf = scipy.stats.norm.cdf(bin_edges, loc=mean, scale=standard_deviation)

The output of these two methods should be equal but it is not:

First: [0.0168047  0.07815162 0.22646339 0.46391741 0.71568769 0.89247475 
0.97468339 1.]
Second: [0.0096921  0.04493372 0.14591031 0.34010566 0.59087116 0.80832701
0.93495018 0.98444529]

For me it looks like scipy cdf() result is worse. What am I doing wrong?

Upvotes: 2

Views: 1238

Answers (1)

tel
tel

Reputation: 13997

The problem

You're trying to calculate the CDF at every bin edge by calculating the value of the following integral at every bin edge:

enter image description here

The reason why your result disagrees with that of scipy is that scipy is doing the integration better than you are. You're effectively integrating the normal PDF by summing over the area of the "bars" of the histogram that your bin_edges effectively define. This won't produce a reasonably accurate result until your bin count is much, much higher (probably in the thousands at least). Your normalization approach is also off, since really you need to be dividing by the integral of the PDF from -inf to inf, not from 1.02 to 28.7.

On the other hand, Numpy is just calculating a high accuracy numerical approximation of a closed form solution of the integral. The function it uses is called scipy.special.ndtr. Here's it's implementation in the Scipy code.

The solution

Instead of integrating by summing bar areas, you can do actual numerical integration from -inf to x in order to get a result with accuracy approaching that of scipy.stats.norm.cdf. Here's code for how to do that:

import scipy.integrate as snt

def pdf(x, mean, std):
    return ((1/((2*np.pi)**.5 * std)) * np.exp(-.5*((x - mean)/std)**2))

cdf = [snt.quad(pdf, -np.inf, x, args=(mean, std))[0] for x in bin_edges]

Scipy's version of ndtr is written in C, but here's a close Python approximation for comparison purposes:

import scipy.special as sps

def ndtr(x, mean, std):
    return .5 + .5*sps.erf((x - mean)/(std * 2**.5))

Testing it out

import scipy.special as sps
import scipy.stats as sts
import scipy.integrate as snt

bin_edges = np.array([1.02,  4.98,  8.93, 12.89, 16.84, 20.79, 24.75, 28.7])
mean = 15.425
std = 6.159900567379315

with np.printoptions(linewidth=9999):
    print(np.array([snt.quad(pdf, -np.inf, x, args=(mean, std))[0] for x in bin_edges]))
    print(ndtr(bin_edges, mean, std))
    print(sts.norm.cdf(bin_edges, loc=mean, scale=std))

Output:

[0.00968036 0.04497664 0.14584988 0.34034101 0.59084202 0.80811081 0.93496465 0.98442171]
[0.00968036 0.04497664 0.14584988 0.34034101 0.59084202 0.80811081 0.93496465 0.98442171]
[0.00968036 0.04497664 0.14584988 0.34034101 0.59084202 0.80811081 0.93496465 0.98442171]

So when you integrate accurately, the results from the method you were using match up to high precision with those of scipy.stats.norm.cdf.

Upvotes: 4

Related Questions