Reputation: 335
I'm now using scipy
for some norm.pdf
and norm.cdf
calculations. I'm wondering why the cdf
is faster than pdf
?
I know that there are some asymptotic approaches for norm.cdf
, while it seems that in scipy
, the integration of norm.pdf
is used. That's why I cannot imagine cdf
is faster than pdf
. If the integration is the case, cdf
should have been much slower than pdf
(maybe parallel computing can help a lot?); if the asymptotic approach is applied, I still think cdf
might be a little slower than pdf
.
Below shows some simple samples:
import scipy.stats as st
from datetime import datetime
import numpy as np
num_iter = 100000
x_lower = 0.25
x_upper = 0.75
time_start = datetime.now()
for x in np.arange(x_lower, x_upper, (x_upper - x_lower) / (num_iter - 1)):
y = st.norm.pdf(x)
time_end = datetime.now()
print(time_end - time_start)
time_start = datetime.now()
for x in np.arange(x_lower, x_upper, (x_upper - x_lower) / (num_iter - 1)):
y = st.norm.cdf(x)
time_end = datetime.now()
print(time_end - time_start)
And here are the running results:
0:00:05.736985
0:00:04.896390
Upvotes: 3
Views: 1952
Reputation: 170
A quick look in the source code shows that scipy.stats.norm.pdf
simply returns the value for x
of the pdf using NumPy:
def _norm_pdf(x):
return np.exp(-x**2/2.0) / _norm_pdf_C
where _norm_pdf_C = np.sqrt(2*np.pi)
.
For the cdf, since we talk of a normal distribution, special functions are used (for the relation between them and the normal distribution, see here).
SciPy implements special functions directly in C. In particular, the cumulative distribution function is computed from ndtr.c
. So, even if NumPy is really fast, C is still faster in this case, I assume.
EDIT
Sorry, I just realized my answer does not completely answer your question.
First of all, NumPy also implements mathematical operations in C. Therefore, to understand why the difference in the times, one should understand what happens in C.
So I checked again the C implementation for the cdf, and I saw that constants and coefficients of the polynomials that evaluate the special functions are not computed but stored in arrays and variables! For example, 1/sqrt(2)
is contained in NPY_SQRT1_2
. This could be the reason why cdf is faster than pdf!
Thus I tried to calculate the pdf having the constants already initialized:
import scipy.stats as st
from datetime import datetime
import numpy as np
num_iter = 100000
x_lower = 0.25
x_upper = 0.75
const = np.sqrt(2*np.pi)
time_start = datetime.now()
for x in np.arange(x_lower, x_upper, (x_upper - x_lower) / (num_iter - 1)):
# y = st.norm.pdf(x)
y = np.exp((x**2 / 2)) / const
time_end = datetime.now()
print(time_end - time_start)
time_start = datetime.now()
for x in np.arange(x_lower, x_upper, (x_upper - x_lower) / (num_iter - 1)):
y = st.norm.cdf(x)
time_end = datetime.now()
This code gives me:
0:00:00.202531
0:00:07.703083
Notice that also norm.pdf
pre-initializes the denominator of the pdf, but in the for loop you are calling the method every time, slowing things down.
P.S.: if you try to get rid of the loops in your original code and have simply x = np.arange(x_lower, x_upper, (x_upper - x_lower) / (num_iter - 1))
, cdf is again faster. The reason can be that cdf is computed with polynomials approximations. But I did not find information on how exactly C handles the exponential to get a comparison.
Upvotes: 3