Reputation: 595
I am trying to figure out how to use numba
to generate numpy
-style ufuncs for vectorized array operations. I noticed I am having very slow performance so I tried to debug by calling the following in my code as per the numba FAQ:
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
Apparently my loop is not being vectorized 'due to memory conflicts.'
Again from the FAQ page, I see that this occurs 'when memory access pattern is non-trivial.' I'm not clear on what this means, but the code I am trying to vectorize seems pretty trivial to me:
@guvectorize(['void(f4[:,:], b1[:,:], f8, f4, f4[:,:])'],
'(n,m), (n,m), (), () -> (n,m)', cache=True)
def enforce_cutoff(img, mask, max, nodata, out):
for i in range(img.shape[0]):
for j in range(img.shape[1]):
if mask[i,j]:
out[i,j] = nodata
else:
if img[i,j]<max:
out[i,j] = img[i,j]
else:
out[i,j] = max-0.1
Any clues on why this cannot be vectorized and how I could get around it would be very much appreciated. I am pretty new to numba & have no familiarity at all with LLVM so I don't understand much of this well.
The full output from LLVM is here:
LV: Checking a loop in "_ZN7AtmCorr18enforce_cutoff$241E5ArrayIfLi2E1A7mutable7alignedE5ArrayIbLi2E1A7mutable7alignedEdf5ArrayIfLi2E1A7mutable7alignedE" from enforce_cutoff
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: B40.us
LV: Found an induction variable.
LV: Found an induction variable.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing: Cannot prove legality.
LV: Checking a loop in "__gufunc__._ZN7AtmCorr18enforce_cutoff$241E5ArrayIfLi2E1A7mutable7alignedE5ArrayIbLi2E1A7mutable7alignedEdf5ArrayIfLi2E1A7mutable7alignedE" from <numba.npyufunc.wrappers._GufuncWrapper object at 0x0000020A848A6438>
LV: Loop hints: force=? width=0 unroll=0
LV: Not vectorizing: Cannot prove legality.
LV: Checking a loop in "_ZN7AtmCorr18enforce_cutoff$241E5ArrayIfLi2E1A7mutable7alignedE5ArrayIbLi2E1A7mutable7alignedEdf5ArrayIfLi2E1A7mutable7alignedE" from <numba.npyufunc.wrappers._GufuncWrapper object at 0x0000020A848A6438>
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: B40.us
LV: Found an induction variable.
LV: Found an induction variable.
LV: Found an induction variable.
LV: Found an induction variable.
LV: Did not find one integer induction var.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing: Cannot prove legality.
LV: Checking a loop in "_ZN7AtmCorr18enforce_cutoff$241E5ArrayIfLi2E1A7mutable7alignedE5ArrayIbLi2E1A7mutable7alignedEdf5ArrayIfLi2E1A7mutable7alignedE" from <numba.npyufunc.wrappers._GufuncWrapper object at 0x0000020A848A6438>
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: B20.us.us
LV: Found an induction variable.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing: Cannot prove legality.
Upvotes: 1
Views: 270
Reputation: 6482
A possible workaround would be to ensure that the arrays are C-contiguous. If they are not c-contigous they will be copied.
Example
import numba as nb
import numpy as np
@nb.njit(cache=True,parallel=True)
def enforce_cutoff_2(img, mask, max, nodata, out):
#create a contigous copy if array isn't c-contiguous
img=np.ascontiguousarray(img)
mask=np.ascontiguousarray(mask)
for i in nb.prange(img.shape[0]):
for j in range(img.shape[1]):
if mask[i,j]:
out[i,j] = nodata
else:
if img[i,j]<max:
out[i,j] = img[i,j]
else:
out[i,j] = max-0.1
Timings
#contiguous arrays
img=np.random.rand(1000,1000).astype(np.float32)
mask=np.random.rand(1000,1000)>0.5
max=0.5
nodata=1.
out=np.empty((img.shape[0],img.shape[1]),dtype=np.float32)
%timeit enforce_cutoff_2(img, mask, max, nodata, out)
#single-thread
#678 µs ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#parallel
#143 µs ± 1.87 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#non contigous arrays
img=np.random.rand(2000,2000).astype(np.float32)
mask=np.random.rand(2000,2000)>0.5
img=img[0:-1:2,0:-1:2]
mask=mask[0:-1:2,0:-1:2]
max=0.5
nodata=1.
out=np.empty((img.shape[0],img.shape[1]),dtype=np.float32)
%timeit enforce_cutoff_2(img, mask, max, nodata, out)
#single threaded
#with contiguous copy
#1.78 ms ± 9.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#without contiguous copy
#5.76 ms ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#parallel
#with contiguous copy
#1.42 ms ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
##without contiguous copy
#1.08 ms ± 75.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1