Reputation: 2194
I would like to do a 'daxpy' (add to a vector the scalar multiple of a second vector and assign the result to the first) with numpy
using numba
. Doing the following test, I noticed that writing the loop myself was much faster than doing a += c * b
.
I was not expecting this. What is the reason for this behavior?
import numpy as np
from numba import jit
x = np.random.random(int(1e6))
o = np.random.random(int(1e6))
c = 3.4
@jit(nopython=True)
def test1(a, b, c):
a += c * b
return a
@jit(nopython=True)
def test2(a, b, c):
for i in range(len(a)):
a[i] += c * b[i]
return a
%timeit -n100 -r10 test1(x, o, c)
>>> 100 loops, best of 10: 2.48 ms per loop
%timeit -n100 -r10 test2(x, o, c)
>>> 100 loops, best of 10: 1.2 ms per loop
Upvotes: 6
Views: 498
Reputation: 52276
One thing to keep in mind is 'manual looping' in numba
is very fast, essentially the same as the c-loop used by numpy operations.
In the first example there are two operations, a temporary array (c * b
) is allocated / calculated, then that temporary array is added to a
. In the second example, both calculations are happening in the same loop with no intermediate result.
In theory, numba
could fuse loops and optimize #1 to do the same as #2, but it doesn't seem to be doing it. If you just want to optimize numpy ops, numexpr
may also be worth a look as it was designed for exactly that - though probably won't do any better than the explicit fused loop.
In [17]: import numexpr as ne
In [18]: %timeit -r10 test2(x, o, c)
1000 loops, best of 10: 1.36 ms per loop
In [19]: %timeit ne.evaluate('x + o * c', out=x)
1000 loops, best of 3: 1.43 ms per loop
Upvotes: 3