Adria Ciurana
Adria Ciurana

Reputation: 934

How to use parallelism in cython

I'm trying to apply parallelism to the following algorithm. This should be easily parallelizable, as the calculations are independent for the first three dimensions (b, i, j).

def nb_forward(np.ndarray[FLOAT64, ndim=4] inputv, np.ndarray[FLOAT64, ndim=4] kernels, np.ndarray[FLOAT64, ndim=1] bias, tuple stride):
    cdef unsigned int kernel_size0 = kernels.shape[1], kernel_size1 = kernels.shape[2], \
    stride0 = stride[0], stride1 = stride[1], \
    num_dim = kernels.shape[0], \
    num_filters = kernels.shape[3], \
    batch_size = inputv.shape[0]

    cdef unsigned int out_size0 = (inputv.shape[1] - kernel_size0) / stride0 + 1, \
    out_size1 = (inputv.shape[2] - kernel_size1) / stride1 + 1

    cdef double[:, :, :, :] out = np.empty(shape=[batch_size, out_size0, out_size1, num_filters], dtype=np.float64)

    cdef unsigned int b, i, j, m, kw, kh, n
    cdef unsigned int iin, jin
    cdef double acc

    with nogil, parallel():
        for b in prange(batch_size):
            for i in range(out_size0):
                for j in range(out_size1):
                    iin = i*stride0
                    jin = j*stride1

                    for n in range(num_filters):
                        acc = 0.
                        for kw in range(kernel_size0):
                            for kh in range(kernel_size1):
                                for m in range(num_dim):
                                    acc += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
                        out[b, i, j, n] = acc + bias[n]
    return out

Error:
Cannot read reduction variable in loop body

Initially I tried to parallelize only at the level of b, since parallelizing at level b, i, j is at the pixel level and I do not know if it is worth generating as many threads. But I have not succeeded.

I tried to use a temporary array out_batch, but being a numpy array, it is giving me a lot of problems and

Error: malloc problems

I have also tried instead of using numpy array using double arrays (double [:,:,:]) but it gives:

Error: Memoryview slices can only be shared in parallel sections

Does anyone have an idea? Is there any way to apply nogil at the level of b, i, j (or only b) and then compact the data?

Upvotes: 0

Views: 505

Answers (1)

ead
ead

Reputation: 34316

Obviously the variable acc is shared between all threads and thus it can come to raise conditions - Cython rightly doesn't let this code to compile.

The variable acc shouldn't be shared between the threads, but be private to a thread. However, to my limited knowledge, there is no way to do it with cython yet (not sure what happened with this proposal).

A usual workaround is to allocate a large enough working array tmp and to accumulate the value for i-th thread in tmp[i]. Often enough (but not always) already presented arrays can be used for this purpose, so also in your case - by replacing acc through out[b,i,j,n]:

for n in range(num_filters):
    out[b, i, j, n] = 0.
    for kw in range(kernel_size0):
        for kh in range(kernel_size1):
            for m in range(num_dim):
                out[b, i, j, n] += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
    out[b, i, j, n] += bias[n]

Upvotes: 1

Related Questions