Reputation: 448
Suppose I have two vectors and wish to take their dot product; this is simple,
import numpy as np
a = np.random.rand(3)
b = np.random.rand(3)
result = np.dot(a,b)
If I have stacks of vectors and I want each one dotted, the most naive code is
# 5 = number of vectors
a = np.random.rand(5,3)
b = np.random.rand(5,3)
result = [np.dot(aa,bb) for aa, bb in zip(a,b)]
Two ways to batch this computation are using a multiply and sum, and einsum,
result = np.sum(a*b, axis=1)
# or
result = np.einsum('ij,ij->i', a, b)
However, neither of these dispatch to the BLAS backend, and so use only a single core. This is not super great when N
is very large, say 1 million.
tensordot does dispatch to the BLAS backend. A terrible way to do this computation with tensordot is
np.diag(np.tensordot(a,b, axes=[1,1])
This is terrible because it allocates an N*N
matrix, and the majority of the elements are waste work.
Another (brilliantly fast) approach is the hidden inner1d function
from numpy.core.umath_tests import inner1d
result = inner1d(a,b)
but it seems this isn't going to be viable, since the issue that might export it publicly has gone stale. And this still boils down to writing the loop in C, instead of using multiple cores.
Is there a way to get dot
, matmul
, or tensordot
to do all these dot products at once, on multiple cores?
Upvotes: 8
Views: 1983
Reputation: 50478
First of all, there is no direct BLAS function to do that. Using many level 1 BLAS function calls is not very efficient since using multiple threads for a very short-timed computation tends to introduce a pretty-big overhead and not using multiple threads may be sub-optimal. Still, such computation is mainly memory-bound and so it scales poorly on platform with many cores (few cores are often enough to saturate the memory bandwidth).
One simple solution is to use the Numexpr package which should do that quite efficiently (it should avoid the creation of temporary arrays and should also use multiple threads). However, the performance is somewhat disappointing for big arrays in this case.
The best solution appears to use Numba (or Cython). Numba can generate a fast code for both small and big input arrays and it is easy to parallelize the code. However, please note that managing threads introduces an overhead which can be quite big for small arrays (up to few ms on some many-core platforms).
Here is a Numexpr implementation:
import numexpr as ne
expr = ne.NumExpr('sum(a * b, axis=1)')
result = expr.run(a, b)
Here is a (sequential) Numba implementation:
import numba as nb
# Use `parallel=True` for a parallel implementation
@nb.njit('float64[:](float64[:,::1], float64[:,::1])')
def multiDots(a, b):
assert a.shape == b.shape
n, m = a.shape
res = np.empty(n, dtype=np.float64)
# Use `nb.prange` instead of `range` to run the loop in parallel
for i in range(n):
s = 0.0
for j in range(m):
s += a[i,j] * b[i,j]
res[i] = s
return res
result = multiDots(a, b)
Here are some benchmarks on a (old) 2-core machine:
On small 5x3 arrays:
np.einsum('ij,ij->i', a, b, optimize=True): 45.2 us
Numba (parallel): 12.1 us
np.sum(a*b, axis=1): 9.5 us
np.einsum('ij,ij->i', a, b): 6.5 us
Numexpr: 3.2 us
inner1d(a, b): 1.8 us
Numba (sequential): 1.3 us
On small 1000000x3 arrays:
np.sum(a*b, axis=1): 27.8 ms
Numexpr: 15.3 ms
np.einsum('ij,ij->i', a, b, optimize=True): 9.0 ms
np.einsum('ij,ij->i', a, b): 8.8 ms
Numba (sequential): 6.8 ms
inner1d(a, b): 6.5 ms
Numba (parallel): 5.3 ms
The sequential Numba implementation gives a good trade-off. You can use a switch if you really want the best performance. Choosing the best n
threshold in a platform-independent way is not so easy though.
Upvotes: 5