Reputation: 47
I have written some code in Python and wanted to improve it using Numba's function decorators. Using a just-in-time compiler works fine (`@jit`). However, when I tried to parralize my code, speeding it up even more, the programm strangely runs slower than the non-parralized version.
Here is my code:
@numba.njit(parallel=True)
def get_output(input, depth, input_size, kernel_size, input_depth, kernels, bias):
output = np.zeros((depth, input_size[0] - kernel_size + 1, input_size[0] - kernel_size + 1))
for k in numba.prange(depth):
for i in numba.prange(input_depth):
output[k] += valid_correlate(input[i], kernels[k][i])
output[k] += bias[k]
return output
@numba.njit(fastmath=True)
def apply_filter(mat, filter, point):
result = 0
end_point = (min(mat.shape[0], point[0] + filter.shape[0]),
min(mat.shape[1], point[1] + filter.shape[1]))
point = (max(0, point[0]), max(0, point[1]))
area = mat[point[0]:end_point[0], point[1]:end_point[1]]
if filter.shape != area.shape:
s_filter = filter[0:area.shape[0], 0:area.shape[1]]
else:
s_filter = filter
i_result = np.multiply(area, s_filter)
result = np.sum(i_result)
return result
@numba.njit(nogil=True)
def valid_correlate(mat, filter):
f_mat = np.zeros((mat.shape[0] - filter.shape[0] + 1, mat.shape[1] - filter.shape[0] + 1))
for x in range(f_mat.shape[0]):
for y in range(f_mat.shape[1]):
f_mat[x, y] = apply_filter(mat, filter, (x, y))
return f_mat
With parralel=True
it takes about 0.07 seconds, whilst its about 0.056 seconds without it.
I can't seem to figure out the problem here and would be glad for any help!
Regards
-Eirik
Upvotes: 1
Views: 1314
Reputation: 50308
As pointed out in the comments, creating temporary Numpy array is expensive because allocations do not scale in parallel but also because memory-bound core also do not scale. Here is a modified (untested) code:
@numba.njit(fastmath=True)
def apply_filter(mat, filter, point):
result = 0
end_point = (min(mat.shape[0], point[0] + filter.shape[0]),
min(mat.shape[1], point[1] + filter.shape[1]))
point = (max(0, point[0]), max(0, point[1]))
area = mat[point[0]:end_point[0], point[1]:end_point[1]]
if filter.shape != area.shape:
s_filter = filter[0:area.shape[0], 0:area.shape[1]]
else:
s_filter = filter
res = 0.0
# Replace the slow `np.multiply` call creating a temporary array
for i in range(area.shape[0]):
for j in range(area.shape[1]):
res += area[i, j] + s_filter[i, j]
return res
I assumed the shape of area
and s_filter
are the same. If this is not the case, then please provide a complete reproducible example.
Also note that the first call is slow because of compilation time. One ways to fix that is to provide the function signature. See the Numba documentation for more information about this.
By the way, the operation looks like a custom convolution. If so, then note that FFTs are known to speed this up a lot if the filter is relatively big. For small filter the code can be faster if the compiler know the size of the array at compile time. If the filter is relatively big and separable, then the whole computation can be split in much faster steps.
Upvotes: 2