numpy ufunc/arithmetic performance - integer not using SSE?

Question

Consider the following iPython perf test, where we create a pair of 10,000 long 32-bit vectors and add them. Firstly using integer arithmetic and then using float arithmetic:

from numpy.random import randint
from numpy import int32, float32

a, b = randint(255,size=10000).astype(int32), randint(255,size=10000).astype(int32)
%timeit a+b  # int32 addition, gives 20.6µs per loop

a, b = randint(255,size=10000).astype(float32), randint(255,size=10000).astype(float32)
%timeit a+b  # float32 addition, gives 3.91µs per loop

Why is the floating point version about 5x faster?

If you do the same test with float64 it takes twice as long as float32, which is what you'd expect if we are fully utilizing hardware. However the timing for the integer case seems to be constant for int8 to int64. This, together with the 5x slowdown make me suspect that it is completely failing to use SSE.

For int32, I observe similar 20µs values when a+b is replaced by a & 0xff or a >> 2, suggesting that the problem is not limited to addition.

I'm using numpy 1.9.1, though unfortunately I can't remember whether I complied it locally or downloaded a binary. But either way, this performance observation was pretty shocking to me. How is it possible that the version I have is so hopeless at integer arithmetic?

Edit: I've also tested on a similar, but separate PC, running numpy 1.8, which I'm fairly sure was straight from a PythonXY binary. I got the same results.

Question: Do other people see similar results, if not what can I do to be like them?

Update: I have created a new issue on numpy's github repo.

jtaylor · Accepted Answer

the not yet released numpy 1.10 will also vectorize integer operations, if the compiler supports it. It was added in this change: https://github.com/numpy/numpy/pull/5144

E.g. your testcase with current git head compiled with gcc 4.8 results in the same speed for int and float and the code produced looks decent:

  0.04 │27b:   movdqu (%rdx,%rax,1),%xmm0
 25.33 │       add    $0x1,%r10
       │       movdqu (%r8,%rax,1),%xmm1
       │       paddd  %xmm1,%xmm0
 23.17 │       movups %xmm0,(%rcx,%rax,1)
 34.72 │       add    $0x10,%rax
 16.05 │       cmp    %r10,%rsi
       │     ↑ ja     27b

additional speedups can be archived by using AVX2 if the cpu supports it (e.g. intel haswell), though currently that needs to be done by compiling with OPT="-O3 -mavx2", there is no runtime detection for this in numpy yet.

numpy ufunc/arithmetic performance - integer not using SSE?

Answers (2)

Related Questions