Numpy batch dot product

Question

Suppose I have two vectors and wish to take their dot product; this is simple,

import numpy as np

a = np.random.rand(3)
b = np.random.rand(3)

result = np.dot(a,b)

If I have stacks of vectors and I want each one dotted, the most naive code is

# 5 = number of vectors
a = np.random.rand(5,3)
b = np.random.rand(5,3)
result = [np.dot(aa,bb) for aa, bb in zip(a,b)]

Two ways to batch this computation are using a multiply and sum, and einsum,

result = np.sum(a*b, axis=1)

# or
result = np.einsum('ij,ij->i', a, b)

However, neither of these dispatch to the BLAS backend, and so use only a single core. This is not super great when N is very large, say 1 million.

tensordot does dispatch to the BLAS backend. A terrible way to do this computation with tensordot is

np.diag(np.tensordot(a,b, axes=[1,1])

This is terrible because it allocates an N*N matrix, and the majority of the elements are waste work.

Another (brilliantly fast) approach is the hidden inner1d function

from numpy.core.umath_tests import inner1d

result = inner1d(a,b)

but it seems this isn't going to be viable, since the issue that might export it publicly has gone stale. And this still boils down to writing the loop in C, instead of using multiple cores.

Is there a way to get dot, matmul, or tensordot to do all these dot products at once, on multiple cores?

Numpy batch dot product

Answers (1)

Related Questions