Reputation: 7404
I'm interested in visualizing a loglikelihood function of two variables.
Here is a little code to do that:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
#Generate fake data
x = np.random.normal(0,1,size = 100)
y = 2*x+1 + np.random.normal(size = x.size)
#Create a grid to visualize the log-likelihood
grid = np.linspace(-5,5,101)
B0,B1 = np.meshgrid(grid,grid)
#Compute the log likelihood
LogLik = 0
for xs,ys in zip(x,y):
LogLik+= norm.logpdf(ys, loc = B0+B1*xs)
plt.contourf(B0,B1,LogLik)
The bottleneck in this little piece of code is the computation of the log likelihood, namely
for xs,ys in zip(x,y):
LogLik+= norm.logpdf(ys, loc = B0+B1*xs)
If the length of x or y is very large, then this takes longer than I care for. Is there a way to vectorize the creation of the mean of the distribution (i.e. the B0+B1*xs
) as well as the evaluation of the logpdf
?
Upvotes: 1
Views: 76
Reputation: 5774
This can be vectorized easily by broadcasting newaxis to the arrays. As a result, the bottleneck norm.logpdf
will be executed only once:
log_lh = norm.logpdf(y, loc=B0[..., None] + B1[..., None] * x[None, None, :]).sum(axis=2)
# comparison with LogLik:
np.allclose(LogLik, log_lh)
# Out: True
Refactoring this into a function will allow to time the executions:
def loglik(x, y, B0, B1):
return norm.logpdf(y, loc=B0[..., None] + B1[..., None] * x[None, None, :]).sum(axis=2)
def loglik_loop(x, y, B0, B1):
LogLik = 0
for xs, ys in zip(x, y):
LogLik+= norm.logpdf(ys, loc=B0+B1*xs)
%timeit loglik(x, y, B0, B1)
# Out: 94.1 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loglik_loop(x, y, B0, B1)
# Out: 54 ms ± 4.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
As you can see, this seems to be one of the rare cases, where vectorizing the code does not improve the performance. There seems to be another bottleneck within the norm
module of scipy, which hampers the performance when operating on multi-dimensional arrays.
As a result, your only possibility to improve the performance of your code will be to implement a parallel execution of the loop (replacing the +=
operator with assigning to a fixed array and summing afterwards).
Upvotes: 1