nobe
nobe

Reputation: 61

Is there a way to compute only the real part of a NumPy matmul?

Let's say I have two arrays a and b both with dtype np.complex128 and I want to compute C = np.matmul(a, b).real .

That is, I don't care about the imaginary part, only the real part. Is there a better way to do this? It seems like such a method would provide a decent speedup considering that only around half of the floating point multiplications are actually required.

As a bonus question, what I actually want to do is compute the real part of the einsum np.einsum("ab,bc,cd,da->a", ...).real with rather large operands, so hopefully methods for the basic question are generalizable to einsum.

Upvotes: 6

Views: 300

Answers (2)

Jérôme Richard
Jérôme Richard

Reputation: 50836

When the arrays has a size >1000, the best implementation I found in Numpy-only, is the following:

A2 = A.view(np.float64)
B2 = np.array([B.real, -B.imag]).transpose(1, 0, 2).reshape(B.shape[0]*2, B.shape[1])
out = A2 @ B2

The idea is to transform only one of the array instead of the 2 simultaneously (compared to the other solution so far) so to reduce the overhead. The, first matrix, A, can be reinterpreted to a np.float64 matrix in a way the real and imaginary values are interleaved. For the operation to work, we thus need to interleave the row of the second matrix, B, so the first contains real values and the second contains the opposite imaginary values. This enable the underlying BLAS operation to directly compute the expression [email protected] - [email protected] efficiently without any additional temporary arrays.

The thing is the pure-Numpy code is sub-optimal mainly because Numpy creates a temporary array for -B.imag, another for [B.real, -B.imag], and also because the transposition trick is simply sub-optimal. We can use Numba to make this operation faster by avoiding the two temporary arrays and perform a specialized kind of transposition. We can also benefit from using multiple threads so the whole operation completely runs in parallel. Here is the resulting code:

import numba as nb

@nb.njit('(complex128[:,::1], complex128[:,::1])', parallel=True)
def fastest_matmul(a, b):
    bn, bm = b.shape
    b2 = np.empty((bn*2,bm), dtype=np.float64)
    for i in nb.prange(bn):
        for j in range(bm):
            b2[i*2, j] = b.real[i, j]
            b2[i*2+1, j] = -b.imag[i, j]
    return a.view(np.float64) @ b2

For even better performance in a computing loop where the matrices have the same size, the output array can be preallocated and pre-filled once with the following code:

# Done only once
# Note it is NOT equivalent to `np.zeros` performance-wise: the later does not 
# physically fill the pages with zeros causing page-faults later.
# See https://stackoverflow.com/a/76368471/12939557 for more information.
out = np.full((A.shape[0], B.shape[1]), 0, np.float64)
B2 = np.full((B.shape[0]*2, B.shape[1]), 0, np.float64)

@nb.njit('(complex128[:,::1], float64[:,::1])', parallel=True)
def prepare(b, b2):
    bn, bm = b.shape
    for i in nb.prange(bn):
        for j in range(bm):
            b2[i*2, j] = b.real[i, j]
            b2[i*2+1, j] = -b.imag[i, j]

prepare(B, B2)
np.matmul(A.view(np.float64), B2, out)

Benchmark

Here are results on my i5-9600KF CPU (6 cores) with 1500x1500 matrices:

ThomasIsCoding's option 1:              21652   ms  ± 59 ms
Naive solution:                           102.1 ms  ± 2.71 ms
ThomasIsCoding's option 3:                 90.0 ms  ± 1.69 ms
hpaulj's suggestion:                       88.8 ms  ± 1.86 ms
ThomasIsCoding's option 2:                 87.1 ms  ± 1.71 ms
Pure-Numpy proposed solution:              79.3 ms  ± 0.95 ms
Numba proposed solution (out-of-place):    60.9 ms  ± 1.17 ms    <-----
Numba proposed solution (in-place):        56.2 ms  ± 1.11 ms    <-----
Approx optimal time:                       52.8 ms  ± 1.45 ms

The optimal time is the time to compute np.matmul(A2, B2, out), where A2 and B2 are the one defined previously and out is a preallocated/prefilled array. This time is close to the naive time divided by two which make sens since there is twice more FLOP required for the naive solution.

We can see that the Numba solution is much faster than the other solutions, especially when the output are preallocated/prefilled. The two proposed Numba solutions are respectively 67% and 81% faster than the naive solution (and 55% faster than the ThomasIsCoding's fastest solution). The in-place Numba code is very close to the optimal solution.


Note and Discussion

5% of the in-place Numba solution are spent in the prepare function and this function does not scale. This is because this operation is mostly memory-bound. A way to make this faster is to write a specific BLAS implementation without requiring any temporary array, but this is very hard to do. Indeed, one need to write an implementation as optimized as the fastest BLAS ones, but for complex numbers, so to really compete with the Numba implementation (assuming a fast BLAS is used in the Numpy/Numba code). To see how hard it is, you can read this post. The fastest BLAS implementations use an aggressively-optimized platform-dependent manually-written assembly code taking care of unrolling the loops, prefetching data and operate on memory tiles (possibly even multiple level of tiles to consider the multiple level of cache a modern CPU can have). For example, when I tried, I only succeed to reach 85% of the performance of OpenBLAS (which is very close to be optimal on my machine) at best using a pretty-big aggressively-optimized native code (working only on x86 CPUs having the AVX2 and FMA instruction sets) and not a hand-written assembly.

Upvotes: 3

ThomasIsCoding
ThomasIsCoding

Reputation: 102529

I think you can try Option 2 and Option 3

  • Option 1 (Slower, as verified shown in comments)
out = [email protected] - [email protected]
  • Option 2 (Faster)
a = np.asarray([A.real,A.imag])
b = np.asarray([B.real,B.imag])
c = a@b
out = c[0,:,:]-c[1,:,:]
  • Option 3 (Faster)
a = np.column_stack([A.real, A.imag])
b = np.vstack([B.real, -B.imag])
out = a@b

Example

Given data

np.random.seed(0)
m,n,p = 4000, 3000, 5000
A = np.random.rand(m, n) + np.random.rand(m, n) * 1j
B = np.random.rand(n, p) + np.random.rand(n, p) * 1j

you will see

# OP's original solution (baseline)
%%timeit
out = (A@B).real
# 3.36 s ± 232 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# hpaulj's suggestion (see comment)
%%timeit
out = A.real.copy()@B.real.copy()-A.imag.copy()@B.imag.copy()
# 1.93 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Option 2
%%timeit
a = np.asarray([A.real,A.imag])
b = np.asarray([B.real,B.imag])
c = a@b
out = c[0,:,:]-c[1,:,:]
# 1.62 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Option 3
%%timeit
a = np.column_stack([A.real, A.imag])
b = np.vstack([B.real, -B.imag])
out = a@b
# 1.68 s ± 215 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 2

Related Questions