eirikg
eirikg

Reputation: 47

Numba: Parallelization not working properly


I have written some code in Python and wanted to improve it using Numba's function decorators. Using a just-in-time compiler works fine (`@jit`). However, when I tried to parralize my code, speeding it up even more, the programm strangely runs slower than the non-parralized version.

Here is my code:

@numba.njit(parallel=True)
def get_output(input, depth, input_size, kernel_size, input_depth, kernels, bias):
    output = np.zeros((depth, input_size[0] - kernel_size + 1, input_size[0] - kernel_size + 1))
    for k in numba.prange(depth):
        for i in numba.prange(input_depth):
            output[k] += valid_correlate(input[i], kernels[k][i])
        output[k] += bias[k]
    return output

@numba.njit(fastmath=True)
def apply_filter(mat, filter, point):
    result = 0
    end_point = (min(mat.shape[0], point[0] + filter.shape[0]), 
                 min(mat.shape[1], point[1] + filter.shape[1]))
    point = (max(0, point[0]), max(0, point[1]))
    area = mat[point[0]:end_point[0], point[1]:end_point[1]]
    if filter.shape != area.shape:
        s_filter = filter[0:area.shape[0], 0:area.shape[1]]
    else:
        s_filter = filter
    i_result = np.multiply(area, s_filter)
    result = np.sum(i_result)
    return result

@numba.njit(nogil=True)
def valid_correlate(mat, filter):
    f_mat = np.zeros((mat.shape[0] - filter.shape[0] + 1, mat.shape[1] - filter.shape[0] + 1))

    for x in range(f_mat.shape[0]):
        for y in range(f_mat.shape[1]):
            f_mat[x, y] = apply_filter(mat, filter, (x, y))
    return f_mat

With parralel=True it takes about 0.07 seconds, whilst its about 0.056 seconds without it.
I can't seem to figure out the problem here and would be glad for any help!
Regards
-Eirik

Upvotes: 1

Views: 1314

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50308

As pointed out in the comments, creating temporary Numpy array is expensive because allocations do not scale in parallel but also because memory-bound core also do not scale. Here is a modified (untested) code:

@numba.njit(fastmath=True)
def apply_filter(mat, filter, point):
    result = 0
    end_point = (min(mat.shape[0], point[0] + filter.shape[0]), 
                 min(mat.shape[1], point[1] + filter.shape[1]))
    point = (max(0, point[0]), max(0, point[1]))
    area = mat[point[0]:end_point[0], point[1]:end_point[1]]
    if filter.shape != area.shape:
        s_filter = filter[0:area.shape[0], 0:area.shape[1]]
    else:
        s_filter = filter
    res = 0.0
    # Replace the slow `np.multiply` call creating a temporary array
    for i in range(area.shape[0]):
        for j in range(area.shape[1]):
            res += area[i, j] + s_filter[i, j]
    return res

I assumed the shape of area and s_filter are the same. If this is not the case, then please provide a complete reproducible example.

Also note that the first call is slow because of compilation time. One ways to fix that is to provide the function signature. See the Numba documentation for more information about this.

By the way, the operation looks like a custom convolution. If so, then note that FFTs are known to speed this up a lot if the filter is relatively big. For small filter the code can be faster if the compiler know the size of the array at compile time. If the filter is relatively big and separable, then the whole computation can be split in much faster steps.

Upvotes: 2

Related Questions