Ξένη Γήινος
Ξένη Γήινος

Reputation: 3072

How to find extrema per cell in 3 dimensional array with Numba?

I have recently written a script to convert BGR arrays of [0, 1] floats to HSL and back. I posted it on Code Review. There is currently one answer but it doesn't improve performance.

I have benchmarked my code against cv2.cvtColor and found my code to be inefficient, so I want to compile the code with Numba to make it run faster.

I have tried to wrapping every function with @nb.njit(cache=True, fastmath=True), and this doesn't work.

So I have tested every NumPy syntax and NumPy functions I have used individually, and found two functions that don't work with Numba.

I need to find the maximum channel of each pixel (np.max(img, axis=-1)) and minimum channel of each pixel (np.max(img, axis=-1)), and the axis argument doesn't work with Numba.

I have tried to Google search this but the only thing even remotely relevant I found is this, but it only implements np.any and np.all, and only works for two dimensional arrays whereas here the arrays are three-dimensional.

I can write a for loop based solution but I won't write it, because it is bound to be inefficient and against the purpose of using NumPy and Numba in the first place.

Minimal reproducible example:

import numba as nb
import numpy as np

@nb.njit(cache=True, fastmath=True)
def max_per_cell(arr):
    return np.max(arr, axis=-1)

@nb.njit(cache=True, fastmath=True)
def min_per_cell(arr):
    return np.min(arr, axis=-1)

img = np.random.random((3, 4, 3))
max_per_cell(img)
min_per_cell(img)

Exception:

In [2]: max_per_cell(img)
---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
Cell In[2], line 1
----> 1 max_per_cell(img)

File C:\Python310\lib\site-packages\numba\core\dispatcher.py:468, in _DispatcherBase._compile_for_args(self, *args, **kws)
    464         msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
    465                f"by the following argument(s):\n{args_str}\n")
    466         e.patch_message(msg)
--> 468     error_rewrite(e, 'typing')
    469 except errors.UnsupportedError as e:
    470     # Something unsupported is present in the user code, add help info
    471     error_rewrite(e, 'unsupported_error')

File C:\Python310\lib\site-packages\numba\core\dispatcher.py:409, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
    407     raise e
    408 else:
--> 409     raise e.with_traceback(None)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function amax at 0x0000014E306D3370>) found for signature:

 >>> amax(array(float64, 3d, C), axis=Literal[int](-1))

There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'npy_max': File: numba\np\arraymath.py: Line 541.
    With argument(s): '(array(float64, 3d, C), axis=int64)':
   Rejected as the implementation raised a specific error:
     TypingError: got an unexpected keyword argument 'axis'
  raised from C:\Python310\lib\site-packages\numba\core\typing\templates.py:784

During: resolving callee type: Function(<function amax at 0x0000014E306D3370>)
During: typing of call at <ipython-input-1-b3894b8b12b8> (10)


File "<ipython-input-1-b3894b8b12b8>", line 10:
def max_per_cell(arr):
    return np.max(arr, axis=-1)
    ^

How to fix this?

Upvotes: 0

Views: 266

Answers (2)

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50816

CHW implementation

Following the comments of the answer of @NickODell, here is a faster SIMD-friendly solution when the image use a CWH layout (as requested by @NickODell):

import numpy as np
import numba as nb

@nb.vectorize('(float32, float32, float32)')
def vec_max(a, b, c):
    return max(a, b, c)

@nb.njit('(float32[:,:,::1], float32[:,::1])')
def max_per_cell_chw_nb_faster(a, b):
    vec_max(a[0], a[1], a[2], b)

# Benchmark

img1 = np.random.rand(1920, 1080, 3).astype(np.float32)
img2 = np.random.rand(3, 1920, 1080).astype(np.float32)
out = np.empty((1920, 1080), dtype=np.float32)

%timeit out = max_per_cell_nb(img1)
%timeit max_per_cell_chw_nb_faster(img2, out)

Here is the result on my machine (i5-9600KF + 40 GiB/s RAM on Windows):

3 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.08 ms ± 6.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This means this implementation is 3 times faster on 32-bit floats. The former version is scalar, while the later use SIMD instructions. The former is also slowed down by page-faults while the later does not. The SIMD version could be even faster if it would not be memory-bound. The smaller the data-type the faster the SIMD implementation. With 8-bit unsigned integer, the SIMD version is about 9 times faster and still memory-bound:

1.83 ms ± 9.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
210 µs ± 3.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

HWC implementation

Note that the version of @NickODell can be generated by simply replacing the one-line content of max_per_cell_nb_faster by the following so to support HWC images:

@nb.njit('(float32[:,:,::1], float32[:,::1])')
def max_per_cell_hwc_nb_faster(a, b):
    vec_max(a[:,:,0], a[:,:,1], a[:,:,2], b)

However, this version is a bit slower than the solution of @NickODell (despite simpler) since it does not uses SIMD instructions internally:

3.35 ms ± 50.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In fact, AFAIK, the LLVM optimizer is not yet able to vectorize this access-pattern (because this is pretty hard, and it is generally less efficient than using CWH even when done manually and carefully).

Upvotes: 2

Nick ODell
Nick ODell

Reputation: 25384

It's reasonably straightforward to implement this without np.max(), using loops instead:

@nb.njit()
def max_per_cell_nb(arr):
    ret = np.empty(arr.shape[:-1], dtype=arr.dtype)
    n, m = ret.shape
    for i in range(n):
        for j in range(m):
            max_ = arr[i, j, 0]
            max_ = max(max_, arr[i, j, 1])
            max_ = max(max_, arr[i, j, 2])
            ret[i, j] = max_
    return ret

Benchmarking it, it turns out to be about 16x faster than np.max(arr, axis=-1).

%timeit max_per_cell_nb(img)
4.88 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit max_per_cell(img)
81 ms ± 654 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

While benchmarking this, I made the following assumptions:

  • The image is 1920x1080x3. (In other words, it's a big image.)
  • The image array is in C order rather than Fortran order. If it's in Fortran order, the speed of my method drops to 7ms, and the speed of np.max() gets faster and only takes 15 ms. See Check if numpy array is contiguous? for how to tell if your array is in C or Fortran order. Your example of np.random.random((3, 4, 3)) is C contiguous.
  • I'm comparing this function to np.max(arr, axis=-1) with Numba JIT turned off, because it can't really optimize single calls to NumPy functions.

Upvotes: 4

Related Questions