Reputation: 139
I need to plot normal cumulative distribution for given edges of bins:
bin_edges = np.array([1.02, 4.98, 8.93, 12.89, 16.84, 20.79, 24.75, 28.7])
mean = 15.425
standard_deviation = 6.159900567379315
First I did:
cdf = ((1 / (np.sqrt(2 * np.pi) * standard_deviation)) *
np.exp(-0.5 * (1 / standard_deviation * (bin_edges - mean))**2))
cdf = cdf.cumsum()
cdf /= cdf[-1]
Another way I found:
cdf = scipy.stats.norm.cdf(bin_edges, loc=mean, scale=standard_deviation)
The output of these two methods should be equal but it is not:
First: [0.0168047 0.07815162 0.22646339 0.46391741 0.71568769 0.89247475
0.97468339 1.]
Second: [0.0096921 0.04493372 0.14591031 0.34010566 0.59087116 0.80832701
0.93495018 0.98444529]
For me it looks like scipy cdf() result is worse. What am I doing wrong?
Upvotes: 2
Views: 1238
Reputation: 13997
You're trying to calculate the CDF at every bin edge by calculating the value of the following integral at every bin edge:
The reason why your result disagrees with that of scipy
is that scipy
is doing the integration better than you are. You're effectively integrating the normal PDF by summing over the area of the "bars" of the histogram that your bin_edges
effectively define. This won't produce a reasonably accurate result until your bin count is much, much higher (probably in the thousands at least). Your normalization approach is also off, since really you need to be dividing by the integral of the PDF from -inf
to inf
, not from 1.02
to 28.7
.
On the other hand, Numpy is just calculating a high accuracy numerical approximation of a closed form solution of the integral. The function it uses is called scipy.special.ndtr
. Here's it's implementation in the Scipy code.
Instead of integrating by summing bar areas, you can do actual numerical integration from -inf
to x
in order to get a result with accuracy approaching that of scipy.stats.norm.cdf
. Here's code for how to do that:
import scipy.integrate as snt
def pdf(x, mean, std):
return ((1/((2*np.pi)**.5 * std)) * np.exp(-.5*((x - mean)/std)**2))
cdf = [snt.quad(pdf, -np.inf, x, args=(mean, std))[0] for x in bin_edges]
Scipy's version of ndtr
is written in C, but here's a close Python approximation for comparison purposes:
import scipy.special as sps
def ndtr(x, mean, std):
return .5 + .5*sps.erf((x - mean)/(std * 2**.5))
import scipy.special as sps
import scipy.stats as sts
import scipy.integrate as snt
bin_edges = np.array([1.02, 4.98, 8.93, 12.89, 16.84, 20.79, 24.75, 28.7])
mean = 15.425
std = 6.159900567379315
with np.printoptions(linewidth=9999):
print(np.array([snt.quad(pdf, -np.inf, x, args=(mean, std))[0] for x in bin_edges]))
print(ndtr(bin_edges, mean, std))
print(sts.norm.cdf(bin_edges, loc=mean, scale=std))
Output:
[0.00968036 0.04497664 0.14584988 0.34034101 0.59084202 0.80811081 0.93496465 0.98442171]
[0.00968036 0.04497664 0.14584988 0.34034101 0.59084202 0.80811081 0.93496465 0.98442171]
[0.00968036 0.04497664 0.14584988 0.34034101 0.59084202 0.80811081 0.93496465 0.98442171]
So when you integrate accurately, the results from the method you were using match up to high precision with those of scipy.stats.norm.cdf
.
Upvotes: 4