Reputation: 61
Let's say I have two arrays a
and b
both with dtype np.complex128
and I want to compute C = np.matmul(a, b).real
.
That is, I don't care about the imaginary part, only the real part. Is there a better way to do this? It seems like such a method would provide a decent speedup considering that only around half of the floating point multiplications are actually required.
As a bonus question, what I actually want to do is compute the real part of the einsum np.einsum("ab,bc,cd,da->a", ...).real
with rather large operands, so hopefully methods for the basic question are generalizable to einsum.
Upvotes: 6
Views: 300
Reputation: 50836
When the arrays has a size >1000, the best implementation I found in Numpy-only, is the following:
A2 = A.view(np.float64)
B2 = np.array([B.real, -B.imag]).transpose(1, 0, 2).reshape(B.shape[0]*2, B.shape[1])
out = A2 @ B2
The idea is to transform only one of the array instead of the 2 simultaneously (compared to the other solution so far) so to reduce the overhead. The, first matrix, A
, can be reinterpreted to a np.float64
matrix in a way the real and imaginary values are interleaved. For the operation to work, we thus need to interleave the row of the second matrix, B
, so the first contains real values and the second contains the opposite imaginary values. This enable the underlying BLAS operation to directly compute the expression [email protected] - [email protected]
efficiently without any additional temporary arrays.
The thing is the pure-Numpy code is sub-optimal mainly because Numpy creates a temporary array for -B.imag
, another for [B.real, -B.imag]
, and also because the transposition trick is simply sub-optimal. We can use Numba to make this operation faster by avoiding the two temporary arrays and perform a specialized kind of transposition. We can also benefit from using multiple threads so the whole operation completely runs in parallel. Here is the resulting code:
import numba as nb
@nb.njit('(complex128[:,::1], complex128[:,::1])', parallel=True)
def fastest_matmul(a, b):
bn, bm = b.shape
b2 = np.empty((bn*2,bm), dtype=np.float64)
for i in nb.prange(bn):
for j in range(bm):
b2[i*2, j] = b.real[i, j]
b2[i*2+1, j] = -b.imag[i, j]
return a.view(np.float64) @ b2
For even better performance in a computing loop where the matrices have the same size, the output array can be preallocated and pre-filled once with the following code:
# Done only once
# Note it is NOT equivalent to `np.zeros` performance-wise: the later does not
# physically fill the pages with zeros causing page-faults later.
# See https://stackoverflow.com/a/76368471/12939557 for more information.
out = np.full((A.shape[0], B.shape[1]), 0, np.float64)
B2 = np.full((B.shape[0]*2, B.shape[1]), 0, np.float64)
@nb.njit('(complex128[:,::1], float64[:,::1])', parallel=True)
def prepare(b, b2):
bn, bm = b.shape
for i in nb.prange(bn):
for j in range(bm):
b2[i*2, j] = b.real[i, j]
b2[i*2+1, j] = -b.imag[i, j]
prepare(B, B2)
np.matmul(A.view(np.float64), B2, out)
Here are results on my i5-9600KF CPU (6 cores) with 1500x1500 matrices:
ThomasIsCoding's option 1: 21652 ms ± 59 ms
Naive solution: 102.1 ms ± 2.71 ms
ThomasIsCoding's option 3: 90.0 ms ± 1.69 ms
hpaulj's suggestion: 88.8 ms ± 1.86 ms
ThomasIsCoding's option 2: 87.1 ms ± 1.71 ms
Pure-Numpy proposed solution: 79.3 ms ± 0.95 ms
Numba proposed solution (out-of-place): 60.9 ms ± 1.17 ms <-----
Numba proposed solution (in-place): 56.2 ms ± 1.11 ms <-----
Approx optimal time: 52.8 ms ± 1.45 ms
The optimal time is the time to compute np.matmul(A2, B2, out)
, where A2
and B2
are the one defined previously and out
is a preallocated/prefilled array. This time is close to the naive time divided by two which make sens since there is twice more FLOP required for the naive solution.
We can see that the Numba solution is much faster than the other solutions, especially when the output are preallocated/prefilled. The two proposed Numba solutions are respectively 67% and 81% faster than the naive solution (and 55% faster than the ThomasIsCoding's fastest solution). The in-place Numba code is very close to the optimal solution.
5% of the in-place Numba solution are spent in the prepare
function and this function does not scale. This is because this operation is mostly memory-bound. A way to make this faster is to write a specific BLAS implementation without requiring any temporary array, but this is very hard to do. Indeed, one need to write an implementation as optimized as the fastest BLAS ones, but for complex numbers, so to really compete with the Numba implementation (assuming a fast BLAS is used in the Numpy/Numba code). To see how hard it is, you can read this post. The fastest BLAS implementations use an aggressively-optimized platform-dependent manually-written assembly code taking care of unrolling the loops, prefetching data and operate on memory tiles (possibly even multiple level of tiles to consider the multiple level of cache a modern CPU can have). For example, when I tried, I only succeed to reach 85% of the performance of OpenBLAS (which is very close to be optimal on my machine) at best using a pretty-big aggressively-optimized native code (working only on x86 CPUs having the AVX2 and FMA instruction sets) and not a hand-written assembly.
Upvotes: 3
Reputation: 102529
I think you can try Option 2
and Option 3
out = [email protected] - [email protected]
a = np.asarray([A.real,A.imag])
b = np.asarray([B.real,B.imag])
c = a@b
out = c[0,:,:]-c[1,:,:]
a = np.column_stack([A.real, A.imag])
b = np.vstack([B.real, -B.imag])
out = a@b
Given data
np.random.seed(0)
m,n,p = 4000, 3000, 5000
A = np.random.rand(m, n) + np.random.rand(m, n) * 1j
B = np.random.rand(n, p) + np.random.rand(n, p) * 1j
you will see
# OP's original solution (baseline)
%%timeit
out = (A@B).real
# 3.36 s ± 232 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# hpaulj's suggestion (see comment)
%%timeit
out = A.real.copy()@B.real.copy()-A.imag.copy()@B.imag.copy()
# 1.93 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Option 2
%%timeit
a = np.asarray([A.real,A.imag])
b = np.asarray([B.real,B.imag])
c = a@b
out = c[0,:,:]-c[1,:,:]
# 1.62 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Option 3
%%timeit
a = np.column_stack([A.real, A.imag])
b = np.vstack([B.real, -B.imag])
out = a@b
# 1.68 s ± 215 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 2