How can I make Numba access arrays as fast as Numpy can?

Question

Problem

Numpy is basically faster at copying array content to another array (or so it seems, for large-enough arrays).

I'm not expecting Numba to be faster but nearly as fast seems like a reasonable target.

Minimum working example

import numpy as np
import numba as nb

def copyto_numpy(a, b):
    np.copyto(a, b, 'no')

@nb.jit(nopython=True)
def copyto_numba(a, b):
    N = len(a)
    for i in range(N):
        b[i] = a[i]

a = np.random.rand(2**20)
b = np.empty_like(a)
copyto_numpy(a, b)
copyto_numba(a, b)

%timeit copyto_numpy(a, b)
%timeit copyto_numba(a, b)

Timings

1.28 ms ± 5.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.19 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Other things tried (and failed)

Dummy loop in the functions to "stay longer in no-python mode";
Capture N from the environment (i.e. no N = len(a));
Force use of flattened arrays: b.flat[i] = a.flat[i];
Use fastmath=True and/or nogil=True in case it triggers vectorization;
Change the type of the loop counter to unsigned integer (32 bits and 64 bits);
Remove the loop and copy the full array at once: b[:] = a (twice slower!);

New version with Cython (as fast as Numba, slower than Numpy, slight variation from @MSeifert's answer):

%load_ext cython

%%cython -c=-march=native -c=-O3 -c=-ftree-vectorize -c=-flto -c=-fuse-linker-plugin
cimport cython

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cpdef copyto_cython_flags(double[::1] in_array, double[::1] out_array):
    cdef Py_ssize_t idx, N = len(in_array)
    for idx in range(N):
        out_array[idx] = in_array[idx]

Checked CPU usage to confirm Numpy was only using one core.

Again, I'm not expecting Numba to be faster but certainly not 70% slower! Is there anything I can do to make this faster? Note that np.copyto is not implemented in nopython mode so it gets very slow with small vectors.

MSeifert · Accepted Answer

I'm not expecting Numba to be faster but certainly not 70% slower!

In my experience that's pretty much always the case, except you're willing to sacrifice accuracy (fastmath - not relevant here because we're not doing any math) or you can utilize multi-threading (in this case probably not worth it because copy is basically memory-bandwidth limited) or multi-processing (as the other answer show this can make it faster for larger arrays). After all it's comparing hand-written (often highly-optimized) code with auto-generated code. That should also answer the question title:

How can I make Numba access arrays as fast as Numpy can?

It's unlikely that you can with numba! Often it's 50-200% slower on big arrays if you try to re-implement some native-NumPy functionality with numba and that's already quite close.

Numba excels when you need to write code operating on arrays that is not already implemented in NumPy, SciPy, or any other optimized library.

However numba is already faster than similar code with Cython (which I find amazing):

%load_ext cython

%%cython
cimport cython

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cpdef copyto_cython(double[::1] in_array, double[::1] out_array):
    cdef Py_ssize_t idx
    for idx in range(len(in_array)):
        out_array[idx] = in_array[idx]

import numpy as np
import numba as nb

def copyto_numpy(a, b):
    np.copyto(a, b, 'no')

@nb.jit(nopython=True)
def copyto_numba(a, b):
    N = len(a)
    for i in range(N):
        b[i] = a[i]

I'm using my own library simple_benchmark for the performance measurements here:

from simple_benchmark import BenchmarkBuilder, MultiArgument

b = BenchmarkBuilder()

b.add_functions([copyto_cython, copyto_numpy, copyto_numba])

@b.add_arguments('array size')
def argument_provider():
    for exp in range(4, 21):
        size = 2**exp
        arr = np.random.rand(size)
        arr2 = np.empty(size)
        yield size, MultiArgument([arr, arr2])

r = b.run()
r.plot()

How can I make Numba access arrays as fast as Numpy can?

Problem

Minimum working example

Timings

Other things tried (and failed)

Answers (2)

There may be an impact of switching to a parallelized version

Related Questions