Reputation: 25033
I'm writing a matrix of distances and eventually I produced the following code
In [83]: import numpy as np
In [84]: np.set_printoptions(linewidth=120,precision=2)
In [85]: n = 7 ; a = np.arange(n) ; o = np.ones(n) ; np.sqrt(np.outer(o,a*a)+np.outer(a*a,o))
Out[85]:
array([[ 0. , 1. , 2. , 3. , 4. , 5. , 6. ],
[ 1. , 1.41, 2.24, 3.16, 4.12, 5.1 , 6.08],
[ 2. , 2.24, 2.83, 3.61, 4.47, 5.39, 6.32],
[ 3. , 3.16, 3.61, 4.24, 5. , 5.83, 6.71],
[ 4. , 4.12, 4.47, 5. , 5.66, 6.4 , 7.21],
[ 5. , 5.1 , 5.39, 5.83, 6.4 , 7.07, 7.81],
[ 6. , 6.08, 6.32, 6.71, 7.21, 7.81, 8.49]])
I told myself "You're wasting an outer product, you fool! Save one of them and use the transpose!", that said I wrote
In [86]: n = 7 ; a = np.outer(np.arange(n)**2, np.ones(n)) ; np.sqrt(a+a.T)
Out[86]:
array([[ 0. , 1. , 2. , 3. , 4. , 5. , 6. ],
[ 1. , 1.41, 2.24, 3.16, 4.12, 5.1 , 6.08],
[ 2. , 2.24, 2.83, 3.61, 4.47, 5.39, 6.32],
[ 3. , 3.16, 3.61, 4.24, 5. , 5.83, 6.71],
[ 4. , 4.12, 4.47, 5. , 5.66, 6.4 , 7.21],
[ 5. , 5.1 , 5.39, 5.83, 6.4 , 7.07, 7.81],
[ 6. , 6.08, 6.32, 6.71, 7.21, 7.81, 8.49]])
So far, so good, I had two (slightly) different implementations of the same idea, one being obviously faster than the other, isn't it?
In [87]: %timeit n = 1001 ; a = np.arange(n) ; o = np.ones(n) ; np.sqrt(np.outer(o,a*a)+np.outer(a*a,o))
100 loops, best of 3: 13.7 ms per loop
In [88]: %timeit n = 1001 ; a = np.outer(np.arange(n)**2, np.ones(n)) ; np.sqrt(a+a.T)
10 loops, best of 3: 19.7 ms per loop
In [89]:
No! the faster implementation is 50% slower!
I'm surprised by the behavior that I've just discovered, am I wrong to be surprised? In different terms, what is the rationale behind the different timings?
Upvotes: 4
Views: 860
Reputation: 231385
Here are some timings with the small n=7
:
In [784]: timeit np.outer(o,a*a)
10000 loops, best of 3: 24.2 µs per loop
In [785]: timeit np.outer(a*a,o)
10000 loops, best of 3: 25.7 µs per loop
In [786]: timeit np.outer(a*a,o)+np.outer(o,a*a)
10000 loops, best of 3: 52.7 µs per loop
The 2 outers take the same time, and their sum is a bit more than their combined time.
In [787]: timeit a2=np.outer(a*a,o); a2+a2.T
10000 loops, best of 3: 33.2 µs per loop
In [788]: timeit a2=np.outer(a*a,o); a2+a2
10000 loops, best of 3: 27.9 µs per loop
In [795]: timeit a2=np.outer(a*a,o); a2.T+a2.T
10000 loops, best of 3: 29.4 µs per loop
Comparing these 2 we see that adding a2.T
to a2
is slower than adding a2
to itself, or even a2.T
to itself. Performing the transpose is cheap, just a matter of changing shape and strides. But the iteration over the mixed strides is slower. The iterator may even use a temporary buffer.
So in my timings pre computing the outer
sames some time, but not as much as one might expect.
For large n
, the summation of the 2 (n,n)
arrays takes about the same times as generating them. So the relative advantage to pre computing the outer
is reduced.
Previous comparison of outer
and a*a.T
omitted.
Upvotes: 1
Reputation: 4017
It's funny that executing your example, I get the oposite results:
In [7]: %timeit n = 1001 ; a = np.arange(n) ; o = np.ones(n) ; np.sqrt(np.outer(o,a*a)+np.outer(a*a,o))
100 loops, best of 3: 17.2 ms per loop
In [8]: %timeit n = 1001 ; a = np.outer(np.arange(n)**2, np.ones(n)) ; np.sqrt(a+a.T)
100 loops, best of 3: 12.8 ms per loop
But this is the fastest and simplest way I could think of:
In [139]: %timeit n = 1001 ; a = np.arange(n); np.sqrt((a**2)[:, np.newaxis]+a**2)
100 loops, best of 3: 10.8 ms per loop
As an aside, if you are working with distances, you might find useful the scipy.spatial.distance
module and the scipy.spatial.distance_matrix
function.
Upvotes: 1
Reputation: 23753
Refactoring your code to reuse a
and o
, I get the opposite:
import timeit
import numpy as np
n = 1001
a = np.arange(n)
o = np.ones(n)
def g(a, o):
z = np.sqrt(np.outer(o,a*a)+np.outer(a*a,o))
def f(a, o):
a = np.outer(a**2, o)
y = np.sqrt(a+a.T)
assert np.all(f(a, o) == g(a, o))
t = Timer('g(a, o)', 'from __main__ import a, o, np, f, g')
print 'g:', t.timeit(100)/100 # g: 0.0166591598767
t = Timer('f(a, o)', 'from __main__ import a, o, np, f, g')
print 'f:', t.timeit(100)/100 # f: 0.0200494056252
Upvotes: 1