Yang Wang
Yang Wang

Reputation: 335

Why norm.cdf is faster than norm.pdf in scipy?

I'm now using scipy for some norm.pdf and norm.cdf calculations. I'm wondering why the cdf is faster than pdf?

I know that there are some asymptotic approaches for norm.cdf, while it seems that in scipy, the integration of norm.pdf is used. That's why I cannot imagine cdf is faster than pdf. If the integration is the case, cdf should have been much slower than pdf (maybe parallel computing can help a lot?); if the asymptotic approach is applied, I still think cdf might be a little slower than pdf.

Below shows some simple samples:

import scipy.stats as st
from datetime import datetime
import numpy as np
num_iter = 100000
x_lower = 0.25
x_upper = 0.75

time_start = datetime.now()
for x in np.arange(x_lower, x_upper, (x_upper - x_lower) / (num_iter - 1)):
    y = st.norm.pdf(x)
time_end = datetime.now()
print(time_end - time_start)

time_start = datetime.now()
for x in np.arange(x_lower, x_upper, (x_upper - x_lower) / (num_iter - 1)):
    y = st.norm.cdf(x)
time_end = datetime.now()
print(time_end - time_start)

And here are the running results:

0:00:05.736985
0:00:04.896390

Upvotes: 3

Views: 1952

Answers (1)

dapetillo
dapetillo

Reputation: 170

A quick look in the source code shows that scipy.stats.norm.pdf simply returns the value for x of the pdf using NumPy:

def _norm_pdf(x):
return np.exp(-x**2/2.0) / _norm_pdf_C

where _norm_pdf_C = np.sqrt(2*np.pi).

For the cdf, since we talk of a normal distribution, special functions are used (for the relation between them and the normal distribution, see here).

SciPy implements special functions directly in C. In particular, the cumulative distribution function is computed from ndtr.c. So, even if NumPy is really fast, C is still faster in this case, I assume.

EDIT

Sorry, I just realized my answer does not completely answer your question.

First of all, NumPy also implements mathematical operations in C. Therefore, to understand why the difference in the times, one should understand what happens in C.

  • If you look at this question, it seems that numerical values and hardware architecture can affect the time.

So I checked again the C implementation for the cdf, and I saw that constants and coefficients of the polynomials that evaluate the special functions are not computed but stored in arrays and variables! For example, 1/sqrt(2) is contained in NPY_SQRT1_2. This could be the reason why cdf is faster than pdf!

Thus I tried to calculate the pdf having the constants already initialized:

import scipy.stats as st
from datetime import datetime
import numpy as np
num_iter = 100000
x_lower = 0.25
x_upper = 0.75

const = np.sqrt(2*np.pi)
time_start = datetime.now()
for x in np.arange(x_lower, x_upper, (x_upper - x_lower) / (num_iter - 1)):
    # y = st.norm.pdf(x)
    y = np.exp((x**2 / 2)) / const
time_end = datetime.now()
print(time_end - time_start)

time_start = datetime.now()
for x in np.arange(x_lower, x_upper, (x_upper - x_lower) / (num_iter - 1)):
    y = st.norm.cdf(x)
time_end = datetime.now()

This code gives me:

0:00:00.202531
0:00:07.703083

Notice that also norm.pdf pre-initializes the denominator of the pdf, but in the for loop you are calling the method every time, slowing things down.

P.S.: if you try to get rid of the loops in your original code and have simply x = np.arange(x_lower, x_upper, (x_upper - x_lower) / (num_iter - 1)), cdf is again faster. The reason can be that cdf is computed with polynomials approximations. But I did not find information on how exactly C handles the exponential to get a comparison.

Upvotes: 3

Related Questions