Reputation: 153
The problem deals with basic matrix operation. In the following code, c1 essentially equals c2. However, the first way of computing is much faster than the second one. In fact, at first I thought the first way needs to allocate a b matrix that is twice larger than the a matrix, hence may be slower. It turns out to be the opposite. Why?
import time
import numpy as np
a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j
tic = time.time()
b = np.vstack((a.real,a.imag))
c1 = b.T @ b
t1 = time.time()-tic
tic = time.time()
c2 = a.real.T @ [email protected]
t2 = time.time()-tic
print('t1=%f. t2=%f.'%(t1,t2))
An example result is
t1=0.037965. t2=4.375873.
Upvotes: 5
Views: 375
Reputation: 12407
a.real
and a.imag
are in place accesses while np.vstack
creates a new copy. The way @
operator (matmul()
) deals with a.real
and a.imag
takes longer time. To make it faster, you can create a copy of each and then pass it to @
or use np.dot(a.real.T, a.real)
and np.dot(a.imag.T, a.imag)
(I am not sure of each implementation in BLAS).
For large matrices, the first method in the following code should still be slightly faster:
a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j
tic = time.time()
b = np.vstack((a.real,a.imag))
c1 = b.T @ b
t1 = time.time()-tic
tic = time.time()
b = a.real.copy()
c = a.imag.copy()
c2 = b.T @ b + c.T @ c
t2 = time.time()-tic
print('t1=%f. t2=%f.'%(t1,t2))
Output:
t1=0.031620. t2=0.021769.
EDIT: Dive deeper:
Let's take a quick look at various memory layouts:
a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j
print('a.flags\n', a.flags)
print('a.real flags\n', a.real.flags)
a.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
...
a.real flags
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
So a
is C_CONTIGUOUS
and a.real
is not. I am not sure how @
implements the calculations, but my guess is it is different with cache tricks and strides and unrolled loops. I will leave that to experts to explain. Now, array.copy()
by default is C_CONTIGUOUS
(BE CAREFUL: np.copy()
is not C_CONTIGUOUS
by default.) and that is why the second approach above is as fast as first one (where b
is also C_CONTIGUOUS
).
@George C: I think the first one is slightly faster because np.vstack
creates a new C_CONTIGUOUS
object that can leverage cache tricks in one place, while in the second approach the output of a.real.T @ a.real
and [email protected]
are at different locations of memory and require extra effort to calculate. Here is a link to more explanation.
Disclaimer: Any expert that knows the details of NumPy implementation are welcome to edit this post.
Upvotes: 2