How to use parallelism in cython

Question

I'm trying to apply parallelism to the following algorithm. This should be easily parallelizable, as the calculations are independent for the first three dimensions (b, i, j).

def nb_forward(np.ndarray[FLOAT64, ndim=4] inputv, np.ndarray[FLOAT64, ndim=4] kernels, np.ndarray[FLOAT64, ndim=1] bias, tuple stride):
    cdef unsigned int kernel_size0 = kernels.shape[1], kernel_size1 = kernels.shape[2], \
    stride0 = stride[0], stride1 = stride[1], \
    num_dim = kernels.shape[0], \
    num_filters = kernels.shape[3], \
    batch_size = inputv.shape[0]

    cdef unsigned int out_size0 = (inputv.shape[1] - kernel_size0) / stride0 + 1, \
    out_size1 = (inputv.shape[2] - kernel_size1) / stride1 + 1

    cdef double[:, :, :, :] out = np.empty(shape=[batch_size, out_size0, out_size1, num_filters], dtype=np.float64)

    cdef unsigned int b, i, j, m, kw, kh, n
    cdef unsigned int iin, jin
    cdef double acc

    with nogil, parallel():
        for b in prange(batch_size):
            for i in range(out_size0):
                for j in range(out_size1):
                    iin = i*stride0
                    jin = j*stride1

                    for n in range(num_filters):
                        acc = 0.
                        for kw in range(kernel_size0):
                            for kh in range(kernel_size1):
                                for m in range(num_dim):
                                    acc += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
                        out[b, i, j, n] = acc + bias[n]
    return out

Error:
Cannot read reduction variable in loop body

Initially I tried to parallelize only at the level of b, since parallelizing at level b, i, j is at the pixel level and I do not know if it is worth generating as many threads. But I have not succeeded.

I tried to use a temporary array out_batch, but being a numpy array, it is giving me a lot of problems and

Error: malloc problems

I have also tried instead of using numpy array using double arrays (double [:,:,:]) but it gives:

Error: Memoryview slices can only be shared in parallel sections

Does anyone have an idea? Is there any way to apply nogil at the level of b, i, j (or only b) and then compact the data?

How to use parallelism in cython

Answers (1)

Related Questions