Numba: Manual looping faster than a += c * b with numpy arrays?

Question

I would like to do a 'daxpy' (add to a vector the scalar multiple of a second vector and assign the result to the first) with numpy using numba. Doing the following test, I noticed that writing the loop myself was much faster than doing a += c * b.

I was not expecting this. What is the reason for this behavior?

import numpy as np
from numba import jit

x = np.random.random(int(1e6))
o = np.random.random(int(1e6))
c = 3.4

@jit(nopython=True)
def test1(a, b, c):
    a += c * b
    return a

@jit(nopython=True)
def test2(a, b, c):
    for i in range(len(a)):
        a[i] += c * b[i]
    return a

%timeit -n100 -r10 test1(x, o, c)
>>> 100 loops, best of 10: 2.48 ms per loop
%timeit -n100 -r10 test2(x, o, c)
>>> 100 loops, best of 10: 1.2 ms per loop

chrisb · Accepted Answer

One thing to keep in mind is 'manual looping' in numba is very fast, essentially the same as the c-loop used by numpy operations.

In the first example there are two operations, a temporary array (c * b) is allocated / calculated, then that temporary array is added to a. In the second example, both calculations are happening in the same loop with no intermediate result.

In theory, numba could fuse loops and optimize #1 to do the same as #2, but it doesn't seem to be doing it. If you just want to optimize numpy ops, numexpr may also be worth a look as it was designed for exactly that - though probably won't do any better than the explicit fused loop.

In [17]: import numexpr as ne

In [18]: %timeit -r10 test2(x, o, c)
1000 loops, best of 10: 1.36 ms per loop

In [19]: %timeit ne.evaluate('x + o * c', out=x)
1000 loops, best of 3: 1.43 ms per loop

Numba: Manual looping faster than a += c * b with numpy arrays?

Answers (1)

Related Questions