Why one code (matmul) is faster than the other (Python)

Question

The problem deals with basic matrix operation. In the following code, c1 essentially equals c2. However, the first way of computing is much faster than the second one. In fact, at first I thought the first way needs to allocate a b matrix that is twice larger than the a matrix, hence may be slower. It turns out to be the opposite. Why?

import time
import numpy as np
a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j

tic = time.time()
b = np.vstack((a.real,a.imag))
c1 = b.T @ b
t1 = time.time()-tic

tic = time.time()
c2 = a.real.T @ a.real+a.imag.T@a.imag
t2 = time.time()-tic

print('t1=%f. t2=%f.'%(t1,t2))

An example result is

t1=0.037965. t2=4.375873.

Ehsan · Accepted Answer

a.real and a.imag are in place accesses while np.vstack creates a new copy. The way @ operator (matmul()) deals with a.real and a.imag takes longer time. To make it faster, you can create a copy of each and then pass it to @ or use np.dot(a.real.T, a.real) and np.dot(a.imag.T, a.imag) (I am not sure of each implementation in BLAS).

For large matrices, the first method in the following code should still be slightly faster:

a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j

tic = time.time()
b = np.vstack((a.real,a.imag))
c1 = b.T @ b
t1 = time.time()-tic

tic = time.time()
b = a.real.copy()
c = a.imag.copy()
c2 = b.T @ b + c.T @ c
t2 = time.time()-tic

print('t1=%f. t2=%f.'%(t1,t2))

Output:

t1=0.031620. t2=0.021769.

EDIT: Dive deeper:

Let's take a quick look at various memory layouts:

a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j
print('a.flags
', a.flags)
print('a.real flags
', a.real.flags)


a.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
...
a.real flags
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False

So a is C_CONTIGUOUS and a.real is not. I am not sure how @ implements the calculations, but my guess is it is different with cache tricks and strides and unrolled loops. I will leave that to experts to explain. Now, array.copy() by default is C_CONTIGUOUS (BE CAREFUL: np.copy() is not C_CONTIGUOUS by default.) and that is why the second approach above is as fast as first one (where b is also C_CONTIGUOUS).

@George C: I think the first one is slightly faster because np.vstack creates a new C_CONTIGUOUS object that can leverage cache tricks in one place, while in the second approach the output of a.real.T @ a.real and a.imag.T@a.imag are at different locations of memory and require extra effort to calculate. Here is a link to more explanation.
Disclaimer: Any expert that knows the details of NumPy implementation are welcome to edit this post.

Why one code (matmul) is faster than the other (Python)

Answers (1)

Related Questions