Demetri Pananos
Demetri Pananos

Reputation: 7404

Summing product of two arrays of different sizes

I'm interested in visualizing a loglikelihood function of two variables.

Here is a little code to do that:

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

#Generate fake data
x = np.random.normal(0,1,size = 100)
y = 2*x+1 + np.random.normal(size = x.size)

#Create a grid to visualize the log-likelihood
grid = np.linspace(-5,5,101)
B0,B1 = np.meshgrid(grid,grid)

#Compute the log likelihood
LogLik = 0
for xs,ys in zip(x,y):
    LogLik+= norm.logpdf(ys, loc = B0+B1*xs)

plt.contourf(B0,B1,LogLik)

The bottleneck in this little piece of code is the computation of the log likelihood, namely

for xs,ys in zip(x,y):
    LogLik+= norm.logpdf(ys, loc = B0+B1*xs)

If the length of x or y is very large, then this takes longer than I care for. Is there a way to vectorize the creation of the mean of the distribution (i.e. the B0+B1*xs) as well as the evaluation of the logpdf?

Upvotes: 1

Views: 76

Answers (1)

JE_Muc
JE_Muc

Reputation: 5774

This can be vectorized easily by broadcasting newaxis to the arrays. As a result, the bottleneck norm.logpdf will be executed only once:

log_lh = norm.logpdf(y, loc=B0[..., None] + B1[..., None] * x[None, None, :]).sum(axis=2)

# comparison with LogLik:
np.allclose(LogLik, log_lh)
# Out: True

Refactoring this into a function will allow to time the executions:

def loglik(x, y, B0, B1):
    return norm.logpdf(y, loc=B0[..., None] + B1[..., None] * x[None, None, :]).sum(axis=2)

def loglik_loop(x, y, B0, B1):
    LogLik = 0
    for xs, ys in zip(x, y):
        LogLik+= norm.logpdf(ys, loc=B0+B1*xs)

%timeit loglik(x, y, B0, B1)
# Out: 94.1 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loglik_loop(x, y, B0, B1)
# Out: 54 ms ± 4.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you can see, this seems to be one of the rare cases, where vectorizing the code does not improve the performance. There seems to be another bottleneck within the norm module of scipy, which hampers the performance when operating on multi-dimensional arrays.

As a result, your only possibility to improve the performance of your code will be to implement a parallel execution of the loop (replacing the += operator with assigning to a fixed array and summing afterwards).

Upvotes: 1

Related Questions