Reputation: 2401
I'm trying to optimize some code that performs lots of sequential matrix operations.
I figured numpy.linalg.multi_dot
(docs here) would perform all the operations in C or BLAS and thus it would be way faster than going something like arr1.dot(arr2).dot(arr3)
and so on.
I was really surprised running this code on a notebook:
v1 = np.random.rand(2,2)
v2 = np.random.rand(2,2)
%%timeit
v1.dot(v2.dot(v1.dot(v2)))
The slowest run took 9.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.14 µs per loop
%%timeit
np.linalg.multi_dot([v1,v2,v1,v2])
The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.9 µs per loop
To find out that the same operation is about 10x slower using multi_dot
.
My questions are:
Upvotes: 1
Views: 2761
Reputation: 2012
It's because your test matrices are too small and too regular; the overhead in figuring out the fastest evaluation order may outweights the potential performance gain.
Using the example from the document:
import numpy as snp
from numpy.linalg import multi_dot
# Prepare some data
A = np.random.rand(10000, 100)
B = np.random.rand(100, 1000)
C = np.random.rand(1000, 5)
D = np.random.rand(5, 333)
%timeit -n 10 multi_dot([A, B, C, D])
%timeit -n 10 np.dot(np.dot(np.dot(A, B), C), D)
%timeit -n 10 A.dot(B).dot(C).dot(D)
Result:
10 loops, best of 3: 12 ms per loop
10 loops, best of 3: 62.7 ms per loop
10 loops, best of 3: 59 ms per loop
multi_dot
improves performance by evaluating the fastest multiplication order in which there are least scalar multiplications.
In the above case, the default regular multiplication order ((AB)C)D
is evaluated as A((BC)D)
--so that a 1000x100 @ 100x1000
multiplication is reduced to 1000x100 @ 100x333
, cutting down at least 2/3
scalar multiplications.
You can verify this by testing
%timeit -n 10 np.dot(A, np.dot(np.dot(B, C), D))
10 loops, best of 3: 19.2 ms per loop
Upvotes: 8