atbug
atbug

Reputation: 838

Is numpy.einsum efficient compared to fortran or C?

I have written a numpy program which is very time consuming. After profiling it, I found that most of the time is spent in numpy.einsum.

Although numpy is a wrapper of LAPACK or BLAS, I don't know whether numpy.einsum's performance is comparable to its counterpart in LAPACK or BLAS.

So, will I get much performance enhancement if I switch to fortran or C?

Upvotes: 5

Views: 2504

Answers (1)

syockit
syockit

Reputation: 5834

Numpy wraps with BLAS only for primitive operations specified with BLAS. That includes dot, innerproduct, vdot, matmul (new in 1.10), and functions that depend on it (tensordot etc.). einsum, on the other hand, only calls BLAS for operations that allow to fall back to it (as of Numpy 1.14.0).

If your problem can be decomposed into several BLAS operations, then I suggest you try that first in Numpy itself. It may require some temporary arrays in between (it would still be the case even if you were to write C/FORTRAN that uses BLAS). You can eliminate certain array creation overhead by using the out= parameter of the functions.

But most of the time, you're using einsum because it's not expressible in BLAS. See a simple example:

a = np.arange(60.).reshape(3,4,5)
b = np.arange(24.).reshape(4,3,2)
c = np.einsum('ijk,jil->kl', a, b)

To express the above in primitive operations, you need to swap the first two axes in b, do a element-wise multiplication for the first two dimensions, then sum them up, for each index k and l.

c2 = np.ndarray((5, 2))
b2 = np.swapaxes(b, 0, 1)
def manualeinsum(c2, a, b):
    ny, nx = c2.shape
    for k in range(ny):
        for l in range(nx):
            c2[k, l] = np.sum(a[..., k]*b2[...,l])
manualeinsum(c2, a, b2)

You can't BLAS that. UPDATE: The above problem can be expressed as matrix multiplication which can be accelerated using BLAS. See comment by @ali_m. For large enough arrays, BLAS approach is faster.

Meanwhile, note that einsum itself is written in C, creates a dimension-specific iterator for the indices given, and is also optimized for SSE.

Upvotes: 4

Related Questions