Reputation: 838
I have written a numpy program which is very time consuming. After profiling it, I found that most of the time is spent in numpy.einsum
.
Although numpy is a wrapper of LAPACK or BLAS, I don't know whether numpy.einsum
's performance is comparable to its counterpart in LAPACK or BLAS.
So, will I get much performance enhancement if I switch to fortran or C?
Upvotes: 5
Views: 2504
Reputation: 5834
Numpy wraps with BLAS only for primitive operations specified with BLAS. That includes dot
, innerproduct
, vdot
, matmul
(new in 1.10), and functions that depend on it (tensordot
etc.). einsum
, on the other hand, only calls BLAS for operations that allow to fall back to it (as of Numpy 1.14.0).
If your problem can be decomposed into several BLAS operations, then I suggest you try that first in Numpy itself. It may require some temporary arrays in between (it would still be the case even if you were to write C/FORTRAN that uses BLAS). You can eliminate certain array creation overhead by using the out=
parameter of the functions.
But most of the time, you're using einsum
because it's not expressible in BLAS. See a simple example:
a = np.arange(60.).reshape(3,4,5)
b = np.arange(24.).reshape(4,3,2)
c = np.einsum('ijk,jil->kl', a, b)
To express the above in primitive operations, you need to swap the first two axes in b
, do a element-wise multiplication for the first two dimensions, then sum them up, for each index k
and l
.
c2 = np.ndarray((5, 2))
b2 = np.swapaxes(b, 0, 1)
def manualeinsum(c2, a, b):
ny, nx = c2.shape
for k in range(ny):
for l in range(nx):
c2[k, l] = np.sum(a[..., k]*b2[...,l])
manualeinsum(c2, a, b2)
You can't BLAS that. UPDATE: The above problem can be expressed as matrix multiplication which can be accelerated using BLAS. See comment by @ali_m. For large enough arrays, BLAS approach is faster.
Meanwhile, note that einsum
itself is written in C, creates a dimension-specific iterator for the indices given, and is also optimized for SSE.
Upvotes: 4