Reputation: 93
Numpy is basically faster at copying array content to another array (or so it seems, for large-enough arrays).
I'm not expecting Numba to be faster but nearly as fast seems like a reasonable target.
import numpy as np
import numba as nb
def copyto_numpy(a, b):
np.copyto(a, b, 'no')
@nb.jit(nopython=True)
def copyto_numba(a, b):
N = len(a)
for i in range(N):
b[i] = a[i]
a = np.random.rand(2**20)
b = np.empty_like(a)
copyto_numpy(a, b)
copyto_numba(a, b)
%timeit copyto_numpy(a, b)
%timeit copyto_numba(a, b)
1.28 ms ± 5.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.19 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N
from the environment (i.e. no N = len(a)
);b.flat[i] = a.flat[i]
;fastmath=True
and/or nogil=True
in case it triggers vectorization;b[:] = a
(twice slower!);New version with Cython (as fast as Numba, slower than Numpy, slight variation from @MSeifert's answer):
%load_ext cython
%%cython -c=-march=native -c=-O3 -c=-ftree-vectorize -c=-flto -c=-fuse-linker-plugin
cimport cython
@cython.boundscheck(False) # Deactivate bounds checking
@cython.wraparound(False) # Deactivate negative indexing.
cpdef copyto_cython_flags(double[::1] in_array, double[::1] out_array):
cdef Py_ssize_t idx, N = len(in_array)
for idx in range(N):
out_array[idx] = in_array[idx]
Checked CPU usage to confirm Numpy was only using one core.
Again, I'm not expecting Numba to be faster but certainly not 70% slower! Is there anything I can do to make this faster? Note that np.copyto
is not implemented in nopython mode so it gets very slow with small vectors.
Upvotes: 4
Views: 1578
Reputation: 6482
It looks like Numba is a tiny bit faster on small arrays, and on par on medium sized arrays. For larger arrays it looks like Numpy is switching to parallelized copy, which has to be implemented in Numba manually. I don't know why cython performance is significantly slower, but this may be due to some compiler flags.
Code
%load_ext cython
%%%%cython -c=-march=native -c=-O3 -c=-ftree-vectorize -c=-flto -c=-fuse-linker-plugin
cimport cython
@cython.boundscheck(False) # Deactivate bounds checking
@cython.wraparound(False) # Deactivate negative indexing.
cpdef copyto(double[::1] in_array, double[::1] out_array):
cdef Py_ssize_t idx
for idx in range(len(in_array)):
out_array[idx] = in_array[idx]
import numpy as np
import numba as nb
def copyto_numpy(a, b):
np.copyto(a, b, 'no')
@nb.jit(nopython=True,parallel=True)
def copyto_numba_p(a, b):
N = len(a)
for i in nb.prange(N):
b[i] = a[i]
@nb.jit(nopython=True)
def copyto_numba_s(a, b):
N = len(a)
for i in nb.prange(N):
b[i] = a[i]
@nb.jit(nopython=True)
def copyto_numba_combined(a, b):
if a.shape[0]>4*10**4:
copyto_numba_p(a, b)
else:
copyto_numba_s(a, b)
from simple_benchmark import BenchmarkBuilder, MultiArgument
b = BenchmarkBuilder()
b.add_functions([copyto, copyto_numpy, copyto_numba_combined,copyto_numba_s,copyto_numba_p])
@b.add_arguments('array size')
def argument_provider():
for exp in range(4, 23):
size = 2**exp
arr = np.random.rand(size)
arr2 = np.empty(size)
yield size, MultiArgument([arr, arr2])
r = b.run()
r.plot()
Windows, Dektop
Here it looks like that numpy isn't switching to parallel copying at a certain threshold (this may be also a matter of configuration). With Numba I implemented the switching manually.
Linux, Workstation
Update: parallel cython version
%%cython -c=-march=native -c=-O3 -c=-ftree-vectorize -c=-flto -c=-fuse-linker-plugin -c=-fopenmp
cimport cython
from cython.parallel import prange
@cython.boundscheck(False) # Deactivate bounds checking
@cython.wraparound(False) # Deactivate negative indexing.
cpdef copyto_p(double[::1] in_array, double[::1] out_array):
cdef Py_ssize_t idx
cdef Py_ssize_t size=len(in_array)
for idx in prange(size,nogil=True):
out_array[idx] = in_array[idx]
Upvotes: 2
Reputation: 152637
I'm not expecting Numba to be faster but certainly not 70% slower!
In my experience that's pretty much always the case, except you're willing to sacrifice accuracy (fastmath
- not relevant here because we're not doing any math) or you can utilize multi-threading (in this case probably not worth it because copy is basically memory-bandwidth limited) or multi-processing (as the other answer show this can make it faster for larger arrays). After all it's comparing hand-written (often highly-optimized) code with auto-generated code. That should also answer the question title:
How can I make Numba access arrays as fast as Numpy can?
It's unlikely that you can with numba! Often it's 50-200% slower on big arrays if you try to re-implement some native-NumPy functionality with numba and that's already quite close.
Numba excels when you need to write code operating on arrays that is not already implemented in NumPy, SciPy, or any other optimized library.
However numba is already faster than similar code with Cython (which I find amazing):
%load_ext cython
%%cython
cimport cython
@cython.boundscheck(False) # Deactivate bounds checking
@cython.wraparound(False) # Deactivate negative indexing.
cpdef copyto_cython(double[::1] in_array, double[::1] out_array):
cdef Py_ssize_t idx
for idx in range(len(in_array)):
out_array[idx] = in_array[idx]
import numpy as np
import numba as nb
def copyto_numpy(a, b):
np.copyto(a, b, 'no')
@nb.jit(nopython=True)
def copyto_numba(a, b):
N = len(a)
for i in range(N):
b[i] = a[i]
I'm using my own library simple_benchmark
for the performance measurements here:
from simple_benchmark import BenchmarkBuilder, MultiArgument
b = BenchmarkBuilder()
b.add_functions([copyto_cython, copyto_numpy, copyto_numba])
@b.add_arguments('array size')
def argument_provider():
for exp in range(4, 21):
size = 2**exp
arr = np.random.rand(size)
arr2 = np.empty(size)
yield size, MultiArgument([arr, arr2])
r = b.run()
r.plot()
Upvotes: 2