Reputation: 2989
Consider the following iPython perf test, where we create a pair of 10,000 long 32-bit vectors and add them. Firstly using integer arithmetic and then using float arithmetic:
from numpy.random import randint
from numpy import int32, float32
a, b = randint(255,size=10000).astype(int32), randint(255,size=10000).astype(int32)
%timeit a+b # int32 addition, gives 20.6µs per loop
a, b = randint(255,size=10000).astype(float32), randint(255,size=10000).astype(float32)
%timeit a+b # float32 addition, gives 3.91µs per loop
Why is the floating point version about 5x faster?
If you do the same test with float64
it takes twice as long as float32
, which is what you'd expect if we are fully utilizing hardware. However the timing for the integer case seems to be constant for int8
to int64
. This, together with the 5x slowdown make me suspect that it is completely failing to use SSE.
For int32
, I observe similar 20µs values when a+b
is replaced by a & 0xff
or a >> 2
, suggesting that the problem is not limited to addition.
I'm using numpy 1.9.1
, though unfortunately I can't remember whether I complied it locally or downloaded a binary. But either way, this performance observation was pretty shocking to me. How is it possible that the version I have is so hopeless at integer arithmetic?
Edit: I've also tested on a similar, but separate PC, running numpy 1.8
, which I'm fairly sure was straight from a PythonXY binary. I got the same results.
Question: Do other people see similar results, if not what can I do to be like them?
Update: I have created a new issue on numpy's github repo.
Upvotes: 1
Views: 866
Reputation: 2434
the not yet released numpy 1.10 will also vectorize integer operations, if the compiler supports it. It was added in this change: https://github.com/numpy/numpy/pull/5144
E.g. your testcase with current git head compiled with gcc 4.8 results in the same speed for int and float and the code produced looks decent:
0.04 │27b: movdqu (%rdx,%rax,1),%xmm0
25.33 │ add $0x1,%r10
│ movdqu (%r8,%rax,1),%xmm1
│ paddd %xmm1,%xmm0
23.17 │ movups %xmm0,(%rcx,%rax,1)
34.72 │ add $0x10,%rax
16.05 │ cmp %r10,%rsi
│ ↑ ja 27b
additional speedups can be archived by using AVX2 if the cpu supports it (e.g. intel haswell), though currently that needs to be done by compiling with OPT="-O3 -mavx2"
, there is no runtime detection for this in numpy yet.
Upvotes: 2
Reputation: 43573
On a modern CPU there are a lot of factors that influence performance. Whether the data is integer or floating point is only one of them.
Factors such as whether the data is in the cache or has to be fetched from RAM (or even worse from swap) will have a big impact.
The compiler that was used to compile numpy will also have a big influence; how good is it at using the SIMD instructions like SSE? Those can speed up array operations significantly.
The results for my system (Intel Core2 Quad Q9300);
In [1]: from numpy.random import randint
In [2]: from numpy import int32, float32, float64
In [3]: a, b = randint(255,size=10000).astype(int32), randint(255,size=10000).astype(int32)
In [4]: %timeit a+b
100000 loops, best of 3: 12.9 µs per loop
In [5]: a, b = randint(255,size=10000).astype(float32), randint(255,size=10000).astype(float32)
In [6]: %timeit a+b
100000 loops, best of 3: 8.25 µs per loop
In [7]: a, b = randint(255,size=10000).astype(float64), randint(255,size=10000).astype(float64)
In [8]: %timeit a+b
100000 loops, best of 3: 13.9 µs per loop
So on this machine, there is no factor of five difference between int32
and float32
. And neither is there a factor of two between float32
and float64
.
From processor utilization I can see that the timeit
loops only use one of the four available cores.
This seems to confirm that these simple operations don't use BLAS routines since this numpy was built with a parallel openBLAS.
The way numpy was compiled will also have significant influence.
Based on the answers to this question, I could see using objdump
that my numpy
uses SSE2 instructions and the xmm
registers.
In [9]: from numpy import show_config
In [10]: show_config()
atlas_threads_info:
library_dirs = ['/usr/local/lib']
language = f77
include_dirs = ['/usr/local/include']
define_macros = [('ATLAS_INFO', '"\\"None\\""')]
libraries = ['alapack', 'ptf77blas', 'ptcblas', 'atlas']
openblas_lapack_info:
NOT AVAILABLE
blas_opt_info:
library_dirs = ['/usr/local/lib']
language = f77
libraries = ['openblasp', 'openblasp']
mkl_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
lapack_opt_info:
library_dirs = ['/usr/local/lib']
language = f77
include_dirs = ['/usr/local/include']
define_macros = [('ATLAS_INFO', '"\\"None\\""')]
libraries = ['alapack', 'ptf77blas', 'ptcblas', 'atlas']
openblas_info:
library_dirs = ['/usr/local/lib']
language = f77
libraries = ['openblasp', 'openblasp']
blas_mkl_info:
NOT AVAILABLE
If you want to see the effect of the BLAS that you use, run the following program with numpy compiled with different BLAS libraries.
from __future__ import print_function
import numpy
import sys
import timeit
try:
import numpy.core._dotblas
print('FAST BLAS')
except ImportError:
print('slow blas')
print("version:", numpy.__version__)
print("maxint:", sys.maxsize)
print()
setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5
t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print("dot:", t.timeit(count)/count, "sec")
On my machine I get;
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807
dot: 0.06626860399264842 sec
Based on the results from this test I switched from ATLAS to OpenBLAS, because it was significantly faster on my machine.
Upvotes: 0