Reputation: 321
I'm relatively new to NumPy and often read that you should avoid to write loops. In many cases I understand how to deal with that, but at the moment I have the following code:
p = np.arange(15).reshape(5,3)
w = np.random.rand(5)
A = np.sum(w[i] * np.outer(p[i], p[i]) for i in range(len(p)))
Does anybody know if there is there a way to avoid the inner for loop?
Thanks in advance!
Upvotes: 4
Views: 263
Reputation: 221634
Approach #1 : With np.einsum
-
np.einsum('ij,ik,i->jk',p,p,w)
Approach #2 : With broadcasting
+ np.tensordot
-
np.tensordot(p[...,None]*p[:,None], w, axes=((0),(0)))
Approach #3 : With np.einsum
+ np.dot
-
np.einsum('ij,i->ji',p,w).dot(p)
Set #1 :
In [653]: p = np.random.rand(50,30)
In [654]: w = np.random.rand(50)
In [655]: %timeit np.einsum('ij,ik,i->jk',p,p,w)
10000 loops, best of 3: 101 µs per loop
In [656]: %timeit np.tensordot(p[...,None]*p[:,None], w, axes=((0),(0)))
10000 loops, best of 3: 124 µs per loop
In [657]: %timeit np.einsum('ij,i->ji',p,w).dot(p)
100000 loops, best of 3: 9.07 µs per loop
Set #2 :
In [658]: p = np.random.rand(500,300)
In [659]: w = np.random.rand(500)
In [660]: %timeit np.einsum('ij,ik,i->jk',p,p,w)
10 loops, best of 3: 139 ms per loop
In [661]: %timeit np.einsum('ij,i->ji',p,w).dot(p)
1000 loops, best of 3: 1.01 ms per loop
The third approach just blew everything else!
Why Approach #3
is 10x-130x faster than Approach #1
?
np.einsum
is implemented in C. In the first approach, with those three strings there i
,j
,k
in its string-notation, we would have three nested loops (in C of course). That's a lot of memory overhead there.
With the third approach, we are only getting into two strings i
, j
, hence two nested loops (in C again) and also leveraging BLAS based matrix-multiplication
with that np.dot
. These two factors are responsible for the amazing speedup with this one.
Upvotes: 6