dan-man
dan-man

Reputation: 2989

numpy ufunc/arithmetic performance - integer not using SSE?

Consider the following iPython perf test, where we create a pair of 10,000 long 32-bit vectors and add them. Firstly using integer arithmetic and then using float arithmetic:

from numpy.random import randint
from numpy import int32, float32

a, b = randint(255,size=10000).astype(int32), randint(255,size=10000).astype(int32)
%timeit a+b  # int32 addition, gives 20.6µs per loop

a, b = randint(255,size=10000).astype(float32), randint(255,size=10000).astype(float32)
%timeit a+b  # float32 addition, gives 3.91µs per loop

Why is the floating point version about 5x faster?

If you do the same test with float64 it takes twice as long as float32, which is what you'd expect if we are fully utilizing hardware. However the timing for the integer case seems to be constant for int8 to int64. This, together with the 5x slowdown make me suspect that it is completely failing to use SSE.

For int32, I observe similar 20µs values when a+b is replaced by a & 0xff or a >> 2, suggesting that the problem is not limited to addition.

I'm using numpy 1.9.1, though unfortunately I can't remember whether I complied it locally or downloaded a binary. But either way, this performance observation was pretty shocking to me. How is it possible that the version I have is so hopeless at integer arithmetic?

Edit: I've also tested on a similar, but separate PC, running numpy 1.8, which I'm fairly sure was straight from a PythonXY binary. I got the same results.

Question: Do other people see similar results, if not what can I do to be like them?

Update: I have created a new issue on numpy's github repo.

Upvotes: 1

Views: 866

Answers (2)

jtaylor
jtaylor

Reputation: 2434

the not yet released numpy 1.10 will also vectorize integer operations, if the compiler supports it. It was added in this change: https://github.com/numpy/numpy/pull/5144

E.g. your testcase with current git head compiled with gcc 4.8 results in the same speed for int and float and the code produced looks decent:

  0.04 │27b:   movdqu (%rdx,%rax,1),%xmm0
 25.33 │       add    $0x1,%r10
       │       movdqu (%r8,%rax,1),%xmm1
       │       paddd  %xmm1,%xmm0
 23.17 │       movups %xmm0,(%rcx,%rax,1)
 34.72 │       add    $0x10,%rax
 16.05 │       cmp    %r10,%rsi
       │     ↑ ja     27b

additional speedups can be archived by using AVX2 if the cpu supports it (e.g. intel haswell), though currently that needs to be done by compiling with OPT="-O3 -mavx2", there is no runtime detection for this in numpy yet.

Upvotes: 2

Roland Smith
Roland Smith

Reputation: 43573

On a modern CPU there are a lot of factors that influence performance. Whether the data is integer or floating point is only one of them.

Factors such as whether the data is in the cache or has to be fetched from RAM (or even worse from swap) will have a big impact.

The compiler that was used to compile numpy will also have a big influence; how good is it at using the SIMD instructions like SSE? Those can speed up array operations significantly.

The results for my system (Intel Core2 Quad Q9300);

In [1]: from numpy.random import randint

In [2]: from numpy import int32, float32, float64

In [3]: a, b = randint(255,size=10000).astype(int32), randint(255,size=10000).astype(int32)

In [4]: %timeit a+b
100000 loops, best of 3: 12.9 µs per loop

In [5]: a, b = randint(255,size=10000).astype(float32), randint(255,size=10000).astype(float32)

In [6]: %timeit a+b
100000 loops, best of 3: 8.25 µs per loop

In [7]: a, b = randint(255,size=10000).astype(float64), randint(255,size=10000).astype(float64)

In [8]: %timeit a+b
100000 loops, best of 3: 13.9 µs per loop

So on this machine, there is no factor of five difference between int32 and float32. And neither is there a factor of two between float32 and float64.

From processor utilization I can see that the timeit loops only use one of the four available cores. This seems to confirm that these simple operations don't use BLAS routines since this numpy was built with a parallel openBLAS.

The way numpy was compiled will also have significant influence. Based on the answers to this question, I could see using objdump that my numpy uses SSE2 instructions and the xmm registers.

In [9]: from numpy import show_config

In [10]: show_config()
atlas_threads_info:
    library_dirs = ['/usr/local/lib']
    language = f77
    include_dirs = ['/usr/local/include']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    libraries = ['alapack', 'ptf77blas', 'ptcblas', 'atlas']
openblas_lapack_info:
  NOT AVAILABLE
blas_opt_info:
    library_dirs = ['/usr/local/lib']
    language = f77
    libraries = ['openblasp', 'openblasp']
mkl_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
lapack_opt_info:
    library_dirs = ['/usr/local/lib']
    language = f77
    include_dirs = ['/usr/local/include']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    libraries = ['alapack', 'ptf77blas', 'ptcblas', 'atlas']
openblas_info:
    library_dirs = ['/usr/local/lib']
    language = f77
    libraries = ['openblasp', 'openblasp']
blas_mkl_info:
  NOT AVAILABLE

If you want to see the effect of the BLAS that you use, run the following program with numpy compiled with different BLAS libraries.

from __future__ import print_function
import numpy
import sys
import timeit

try:
    import numpy.core._dotblas
    print('FAST BLAS')
except ImportError:
    print('slow blas')

print("version:", numpy.__version__)
print("maxint:", sys.maxsize)
print()

setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5

t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print("dot:", t.timeit(count)/count, "sec")

On my machine I get;

FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.06626860399264842 sec

Based on the results from this test I switched from ATLAS to OpenBLAS, because it was significantly faster on my machine.

Upvotes: 0

Related Questions