user11488411
user11488411

Reputation: 93

How can I make Numba access arrays as fast as Numpy can?

Problem

Numpy is basically faster at copying array content to another array (or so it seems, for large-enough arrays).

I'm not expecting Numba to be faster but nearly as fast seems like a reasonable target.

Minimum working example

import numpy as np
import numba as nb

def copyto_numpy(a, b):
    np.copyto(a, b, 'no')

@nb.jit(nopython=True)
def copyto_numba(a, b):
    N = len(a)
    for i in range(N):
        b[i] = a[i]

a = np.random.rand(2**20)
b = np.empty_like(a)
copyto_numpy(a, b)
copyto_numba(a, b)

%timeit copyto_numpy(a, b)
%timeit copyto_numba(a, b)

Timings

1.28 ms ± 5.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.19 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Other things tried (and failed)

Again, I'm not expecting Numba to be faster but certainly not 70% slower! Is there anything I can do to make this faster? Note that np.copyto is not implemented in nopython mode so it gets very slow with small vectors.


Timing comparison of Numba, Cython and Numpy copy_to functions vs. array size.

Upvotes: 4

Views: 1578

Answers (2)

max9111
max9111

Reputation: 6482

There may be an impact of switching to a parallelized version

It looks like Numba is a tiny bit faster on small arrays, and on par on medium sized arrays. For larger arrays it looks like Numpy is switching to parallelized copy, which has to be implemented in Numba manually. I don't know why cython performance is significantly slower, but this may be due to some compiler flags.

Code

%load_ext cython

%%%%cython -c=-march=native -c=-O3 -c=-ftree-vectorize -c=-flto -c=-fuse-linker-plugin
cimport cython

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cpdef copyto(double[::1] in_array, double[::1] out_array):
    cdef Py_ssize_t idx
    for idx in range(len(in_array)):
        out_array[idx] = in_array[idx]

import numpy as np
import numba as nb

def copyto_numpy(a, b):
    np.copyto(a, b, 'no')

@nb.jit(nopython=True,parallel=True)
def copyto_numba_p(a, b):
    N = len(a)
    for i in nb.prange(N):
        b[i] = a[i]

@nb.jit(nopython=True)
def copyto_numba_s(a, b):
    N = len(a)
    for i in nb.prange(N):
        b[i] = a[i]

@nb.jit(nopython=True)
def copyto_numba_combined(a, b):
    if a.shape[0]>4*10**4:
        copyto_numba_p(a, b)
    else:
        copyto_numba_s(a, b)

from simple_benchmark import BenchmarkBuilder, MultiArgument

b = BenchmarkBuilder()

b.add_functions([copyto, copyto_numpy, copyto_numba_combined,copyto_numba_s,copyto_numba_p])

@b.add_arguments('array size')
def argument_provider():
    for exp in range(4, 23):
        size = 2**exp
        arr = np.random.rand(size)
        arr2 = np.empty(size)
        yield size, MultiArgument([arr, arr2])

r = b.run()
r.plot()

Windows, Dektop

Here it looks like that numpy isn't switching to parallel copying at a certain threshold (this may be also a matter of configuration). With Numba I implemented the switching manually.

Windows

Linux, Workstation

Linux

Update: parallel cython version

%%cython -c=-march=native -c=-O3 -c=-ftree-vectorize -c=-flto -c=-fuse-linker-plugin -c=-fopenmp

cimport cython
from cython.parallel import prange
@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cpdef copyto_p(double[::1] in_array, double[::1] out_array):
    cdef Py_ssize_t idx
    cdef Py_ssize_t size=len(in_array)
    for idx in prange(size,nogil=True):
        out_array[idx] = in_array[idx]

Linux

Upvotes: 2

MSeifert
MSeifert

Reputation: 152637

I'm not expecting Numba to be faster but certainly not 70% slower!

In my experience that's pretty much always the case, except you're willing to sacrifice accuracy (fastmath - not relevant here because we're not doing any math) or you can utilize multi-threading (in this case probably not worth it because copy is basically memory-bandwidth limited) or multi-processing (as the other answer show this can make it faster for larger arrays). After all it's comparing hand-written (often highly-optimized) code with auto-generated code. That should also answer the question title:

How can I make Numba access arrays as fast as Numpy can?

It's unlikely that you can with numba! Often it's 50-200% slower on big arrays if you try to re-implement some native-NumPy functionality with numba and that's already quite close.

Numba excels when you need to write code operating on arrays that is not already implemented in NumPy, SciPy, or any other optimized library.


However numba is already faster than similar code with Cython (which I find amazing):

enter image description here

%load_ext cython

%%cython
cimport cython

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cpdef copyto_cython(double[::1] in_array, double[::1] out_array):
    cdef Py_ssize_t idx
    for idx in range(len(in_array)):
        out_array[idx] = in_array[idx]
import numpy as np
import numba as nb

def copyto_numpy(a, b):
    np.copyto(a, b, 'no')

@nb.jit(nopython=True)
def copyto_numba(a, b):
    N = len(a)
    for i in range(N):
        b[i] = a[i]

I'm using my own library simple_benchmark for the performance measurements here:

from simple_benchmark import BenchmarkBuilder, MultiArgument

b = BenchmarkBuilder()

b.add_functions([copyto_cython, copyto_numpy, copyto_numba])

@b.add_arguments('array size')
def argument_provider():
    for exp in range(4, 21):
        size = 2**exp
        arr = np.random.rand(size)
        arr2 = np.empty(size)
        yield size, MultiArgument([arr, arr2])

r = b.run()
r.plot()

Upvotes: 2

Related Questions