Reputation: 3072
I have recently written a script to convert BGR arrays of [0, 1] floats to HSL and back. I posted it on Code Review. There is currently one answer but it doesn't improve performance.
I have benchmarked my code against cv2.cvtColor
and found my code to be inefficient, so I want to compile the code with Numba to make it run faster.
I have tried to wrapping every function with @nb.njit(cache=True, fastmath=True)
, and this doesn't work.
So I have tested every NumPy syntax and NumPy functions I have used individually, and found two functions that don't work with Numba.
I need to find the maximum channel of each pixel (np.max(img, axis=-1)
) and minimum channel of each pixel (np.max(img, axis=-1)
), and the axis
argument doesn't work with Numba.
I have tried to Google search this but the only thing even remotely relevant I found is this, but it only implements np.any
and np.all
, and only works for two dimensional arrays whereas here the arrays are three-dimensional.
I can write a for loop based solution but I won't write it, because it is bound to be inefficient and against the purpose of using NumPy and Numba in the first place.
Minimal reproducible example:
import numba as nb
import numpy as np
@nb.njit(cache=True, fastmath=True)
def max_per_cell(arr):
return np.max(arr, axis=-1)
@nb.njit(cache=True, fastmath=True)
def min_per_cell(arr):
return np.min(arr, axis=-1)
img = np.random.random((3, 4, 3))
max_per_cell(img)
min_per_cell(img)
Exception:
In [2]: max_per_cell(img)
---------------------------------------------------------------------------
TypingError Traceback (most recent call last)
Cell In[2], line 1
----> 1 max_per_cell(img)
File C:\Python310\lib\site-packages\numba\core\dispatcher.py:468, in _DispatcherBase._compile_for_args(self, *args, **kws)
464 msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
465 f"by the following argument(s):\n{args_str}\n")
466 e.patch_message(msg)
--> 468 error_rewrite(e, 'typing')
469 except errors.UnsupportedError as e:
470 # Something unsupported is present in the user code, add help info
471 error_rewrite(e, 'unsupported_error')
File C:\Python310\lib\site-packages\numba\core\dispatcher.py:409, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
407 raise e
408 else:
--> 409 raise e.with_traceback(None)
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function amax at 0x0000014E306D3370>) found for signature:
>>> amax(array(float64, 3d, C), axis=Literal[int](-1))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'npy_max': File: numba\np\arraymath.py: Line 541.
With argument(s): '(array(float64, 3d, C), axis=int64)':
Rejected as the implementation raised a specific error:
TypingError: got an unexpected keyword argument 'axis'
raised from C:\Python310\lib\site-packages\numba\core\typing\templates.py:784
During: resolving callee type: Function(<function amax at 0x0000014E306D3370>)
During: typing of call at <ipython-input-1-b3894b8b12b8> (10)
File "<ipython-input-1-b3894b8b12b8>", line 10:
def max_per_cell(arr):
return np.max(arr, axis=-1)
^
How to fix this?
Upvotes: 0
Views: 266
Reputation: 50816
Following the comments of the answer of @NickODell, here is a faster SIMD-friendly solution when the image use a CWH layout (as requested by @NickODell):
import numpy as np
import numba as nb
@nb.vectorize('(float32, float32, float32)')
def vec_max(a, b, c):
return max(a, b, c)
@nb.njit('(float32[:,:,::1], float32[:,::1])')
def max_per_cell_chw_nb_faster(a, b):
vec_max(a[0], a[1], a[2], b)
# Benchmark
img1 = np.random.rand(1920, 1080, 3).astype(np.float32)
img2 = np.random.rand(3, 1920, 1080).astype(np.float32)
out = np.empty((1920, 1080), dtype=np.float32)
%timeit out = max_per_cell_nb(img1)
%timeit max_per_cell_chw_nb_faster(img2, out)
Here is the result on my machine (i5-9600KF + 40 GiB/s RAM on Windows):
3 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.08 ms ± 6.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This means this implementation is 3 times faster on 32-bit floats. The former version is scalar, while the later use SIMD instructions. The former is also slowed down by page-faults while the later does not. The SIMD version could be even faster if it would not be memory-bound. The smaller the data-type the faster the SIMD implementation. With 8-bit unsigned integer, the SIMD version is about 9 times faster and still memory-bound:
1.83 ms ± 9.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
210 µs ± 3.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note that the version of @NickODell can be generated by simply replacing the one-line content of max_per_cell_nb_faster
by the following so to support HWC images:
@nb.njit('(float32[:,:,::1], float32[:,::1])')
def max_per_cell_hwc_nb_faster(a, b):
vec_max(a[:,:,0], a[:,:,1], a[:,:,2], b)
However, this version is a bit slower than the solution of @NickODell (despite simpler) since it does not uses SIMD instructions internally:
3.35 ms ± 50.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In fact, AFAIK, the LLVM optimizer is not yet able to vectorize this access-pattern (because this is pretty hard, and it is generally less efficient than using CWH even when done manually and carefully).
Upvotes: 2
Reputation: 25384
It's reasonably straightforward to implement this without np.max()
, using loops instead:
@nb.njit()
def max_per_cell_nb(arr):
ret = np.empty(arr.shape[:-1], dtype=arr.dtype)
n, m = ret.shape
for i in range(n):
for j in range(m):
max_ = arr[i, j, 0]
max_ = max(max_, arr[i, j, 1])
max_ = max(max_, arr[i, j, 2])
ret[i, j] = max_
return ret
Benchmarking it, it turns out to be about 16x faster than np.max(arr, axis=-1)
.
%timeit max_per_cell_nb(img)
4.88 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit max_per_cell(img)
81 ms ± 654 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
While benchmarking this, I made the following assumptions:
np.max()
gets faster and only takes 15 ms. See Check if numpy array is contiguous? for how to tell if your array is in C or Fortran order. Your example of np.random.random((3, 4, 3))
is C contiguous.np.max(arr, axis=-1)
with Numba JIT turned off, because it can't really optimize single calls to NumPy functions.Upvotes: 4