Reputation: 101
I have two 2D numpy arrays A, B. I want to use scipy.stats.multivariate_normal to calculate the joint logpdf of each row in A, using each row in B as the covariance matrix. Is there some way of doing this without explicitly looping over rows? A straightforward application of scipy.stats.multivariate_normal to A and B does calculate the logpdf of each row in A (which is what I want), but uses the entire 2D array A as the covariance matrix which is not what I want (I need each row of B to create a different covariance matrix). I am looking for a solution that uses numpy vectorization and avoids explicitly looping over both arrays.
Upvotes: 4
Views: 2769
Reputation: 527
I was also trying to accomplish something similar. Here's my code which takes in three NxD matrices. Each row of X
is a data point, each row of means
is a mean vector, each row of covariances
is the diagonal vector of a diagonal covariance matrix. The result is a length-N vector of log probabilities.
def vectorized_gaussian_logpdf(X, means, covariances):
"""
Compute log N(x_i; mu_i, sigma_i) for each x_i, mu_i, sigma_i
Args:
X : shape (n, d)
Data points
means : shape (n, d)
Mean vectors
covariances : shape (n, d)
Diagonal covariance matrices
Returns:
logpdfs : shape (n,)
Log probabilities
"""
_, d = X.shape
constant = d * np.log(2 * np.pi)
log_determinants = np.log(np.prod(covariances, axis=1))
deviations = X - means
inverses = 1 / covariances
return -0.5 * (constant + log_determinants +
np.sum(deviations * inverses * deviations, axis=1))
Note that this code only works for diagonal covariance matrices. In this special case, the mathematical definition below is simplified: Determinant becomes product over the elements, inverse becomes element-wise reciprocal, and matrix multiplication becomes element-wise multiplication.
A quick test for correctness and running time:
def test_vectorized_gaussian_logpdf():
n = 128**2
d = 64
means = np.random.uniform(-1, 1, (n, d))
covariances = np.random.uniform(0, 2, (n, d))
X = np.random.uniform(-1, 1, (n, d))
refs = []
ref_start = time.time()
for x, mean, covariance in zip(X, means, covariances):
refs.append(scipy.stats.multivariate_normal.logpdf(x, mean, covariance))
ref_time = time.time() - ref_start
fast_start = time.time()
results = vectorized_gaussian_logpdf(X, means, covariances)
fast_time = time.time() - fast_start
print("Reference time:", ref_time)
print("Vectorized time:", fast_time)
print("Speedup:", ref_time / fast_time)
assert np.allclose(results, refs)
I get about 250x speedup. (And yes, my application requires me to calculate 16384 different Gaussians.)
Upvotes: 4