Reputation: 25033

Is it really numpy.outer() faster than transposition?

I'm writing a matrix of distances and eventually I produced the following code

In [83]: import numpy as np

In [84]: np.set_printoptions(linewidth=120,precision=2)

In [85]: n = 7 ; a = np.arange(n) ; o = np.ones(n) ; np.sqrt(np.outer(o,a*a)+np.outer(a*a,o))
Out[85]: 
array([[ 0.  ,  1.  ,  2.  ,  3.  ,  4.  ,  5.  ,  6.  ],
       [ 1.  ,  1.41,  2.24,  3.16,  4.12,  5.1 ,  6.08],
       [ 2.  ,  2.24,  2.83,  3.61,  4.47,  5.39,  6.32],
       [ 3.  ,  3.16,  3.61,  4.24,  5.  ,  5.83,  6.71],
       [ 4.  ,  4.12,  4.47,  5.  ,  5.66,  6.4 ,  7.21],
       [ 5.  ,  5.1 ,  5.39,  5.83,  6.4 ,  7.07,  7.81],
       [ 6.  ,  6.08,  6.32,  6.71,  7.21,  7.81,  8.49]])

I told myself "You're wasting an outer product, you fool! Save one of them and use the transpose!", that said I wrote

In [86]: n = 7 ; a = np.outer(np.arange(n)**2, np.ones(n)) ; np.sqrt(a+a.T)
Out[86]: 
array([[ 0.  ,  1.  ,  2.  ,  3.  ,  4.  ,  5.  ,  6.  ],
       [ 1.  ,  1.41,  2.24,  3.16,  4.12,  5.1 ,  6.08],
       [ 2.  ,  2.24,  2.83,  3.61,  4.47,  5.39,  6.32],
       [ 3.  ,  3.16,  3.61,  4.24,  5.  ,  5.83,  6.71],
       [ 4.  ,  4.12,  4.47,  5.  ,  5.66,  6.4 ,  7.21],
       [ 5.  ,  5.1 ,  5.39,  5.83,  6.4 ,  7.07,  7.81],
       [ 6.  ,  6.08,  6.32,  6.71,  7.21,  7.81,  8.49]])

So far, so good, I had two (slightly) different implementations of the same idea, one being obviously faster than the other, isn't it?

In [87]: %timeit n = 1001 ; a = np.arange(n) ; o = np.ones(n) ; np.sqrt(np.outer(o,a*a)+np.outer(a*a,o))
100 loops, best of 3: 13.7 ms per loop

In [88]: %timeit n = 1001 ; a = np.outer(np.arange(n)**2, np.ones(n)) ; np.sqrt(a+a.T)
10 loops, best of 3: 19.7 ms per loop

In [89]:

No! the faster implementation is 50% slower!

Question

I'm surprised by the behavior that I've just discovered, am I wrong to be surprised? In different terms, what is the rationale behind the different timings?

Upvotes: 4

Answers (3)

hpaulj

Reputation: 231385

Here are some timings with the small n=7:

In [784]: timeit np.outer(o,a*a)
10000 loops, best of 3: 24.2 µs per loop

In [785]: timeit np.outer(a*a,o)
10000 loops, best of 3: 25.7 µs per loop

In [786]: timeit np.outer(a*a,o)+np.outer(o,a*a)
10000 loops, best of 3: 52.7 µs per loop

The 2 outers take the same time, and their sum is a bit more than their combined time.

In [787]: timeit a2=np.outer(a*a,o); a2+a2.T
10000 loops, best of 3: 33.2 µs per loop

In [788]: timeit a2=np.outer(a*a,o); a2+a2
10000 loops, best of 3: 27.9 µs per loop

In [795]: timeit a2=np.outer(a*a,o); a2.T+a2.T
10000 loops, best of 3: 29.4 µs per loop

Comparing these 2 we see that adding a2.T to a2 is slower than adding a2 to itself, or even a2.T to itself. Performing the transpose is cheap, just a matter of changing shape and strides. But the iteration over the mixed strides is slower. The iterator may even use a temporary buffer.

So in my timings pre computing the outer sames some time, but not as much as one might expect.

For large n, the summation of the 2 (n,n) arrays takes about the same times as generating them. So the relative advantage to pre computing the outer is reduced.

Previous comparison of outer and a*a.T omitted.

Upvotes: 1

Ramon Crehuet

Reputation: 4017

It's funny that executing your example, I get the oposite results:

In [7]: %timeit n = 1001 ; a = np.arange(n) ; o = np.ones(n) ; np.sqrt(np.outer(o,a*a)+np.outer(a*a,o))
100 loops, best of 3: 17.2 ms per loop

In [8]: %timeit n = 1001 ; a = np.outer(np.arange(n)**2, np.ones(n)) ; np.sqrt(a+a.T)
100 loops, best of 3: 12.8 ms per loop

But this is the fastest and simplest way I could think of:

In [139]: %timeit n = 1001 ; a = np.arange(n); np.sqrt((a**2)[:, np.newaxis]+a**2)
100 loops, best of 3: 10.8 ms per loop

As an aside, if you are working with distances, you might find useful the scipy.spatial.distance module and the scipy.spatial.distance_matrix function.

Upvotes: 1

wwii

Reputation: 23753

Refactoring your code to reuse a and o, I get the opposite:

import timeit
import numpy as np
n = 1001
a = np.arange(n)
o = np.ones(n)
def g(a, o):
    z = np.sqrt(np.outer(o,a*a)+np.outer(a*a,o))

def f(a, o):
    a = np.outer(a**2, o)
    y = np.sqrt(a+a.T)

assert np.all(f(a, o) == g(a, o))

t  = Timer('g(a, o)', 'from __main__ import a, o, np, f, g')
print 'g:', t.timeit(100)/100    # g: 0.0166591598767
t  = Timer('f(a, o)', 'from __main__ import a, o, np, f, g')
print 'f:', t.timeit(100)/100    # f: 0.0200494056252

Upvotes: 1

Is it really numpy.outer() faster than transposition?

Question

Answers (3)

Related Questions