Reputation: 934
I'm trying to apply parallelism to the following algorithm. This should be easily parallelizable, as the calculations are independent for the first three dimensions (b, i, j)
.
def nb_forward(np.ndarray[FLOAT64, ndim=4] inputv, np.ndarray[FLOAT64, ndim=4] kernels, np.ndarray[FLOAT64, ndim=1] bias, tuple stride):
cdef unsigned int kernel_size0 = kernels.shape[1], kernel_size1 = kernels.shape[2], \
stride0 = stride[0], stride1 = stride[1], \
num_dim = kernels.shape[0], \
num_filters = kernels.shape[3], \
batch_size = inputv.shape[0]
cdef unsigned int out_size0 = (inputv.shape[1] - kernel_size0) / stride0 + 1, \
out_size1 = (inputv.shape[2] - kernel_size1) / stride1 + 1
cdef double[:, :, :, :] out = np.empty(shape=[batch_size, out_size0, out_size1, num_filters], dtype=np.float64)
cdef unsigned int b, i, j, m, kw, kh, n
cdef unsigned int iin, jin
cdef double acc
with nogil, parallel():
for b in prange(batch_size):
for i in range(out_size0):
for j in range(out_size1):
iin = i*stride0
jin = j*stride1
for n in range(num_filters):
acc = 0.
for kw in range(kernel_size0):
for kh in range(kernel_size1):
for m in range(num_dim):
acc += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
out[b, i, j, n] = acc + bias[n]
return out
Error:
Cannot read reduction variable in loop body
Initially I tried to parallelize only at the level of b
, since parallelizing at level b
, i, j
is at the pixel level and I do not know if it is worth generating as many threads. But I have not succeeded.
I tried to use a temporary array out_batch
, but being a numpy
array, it is giving me a lot of problems and
Error: malloc problems
I have also tried instead of using numpy
array using double arrays
(double [:,:,:]
) but it gives:
Error: Memoryview slices can only be shared in parallel sections
Does anyone have an idea? Is there any way to apply nogil at the level of b
, i
, j
(or only b
) and then compact the data?
Upvotes: 0
Views: 505
Reputation: 34316
Obviously the variable acc
is shared between all threads and thus it can come to raise conditions - Cython rightly doesn't let this code to compile.
The variable acc
shouldn't be shared between the threads, but be private to a thread. However, to my limited knowledge, there is no way to do it with cython yet (not sure what happened with this proposal).
A usual workaround is to allocate a large enough working array tmp
and to accumulate the value for i-th thread in tmp[i]
. Often enough (but not always) already presented arrays can be used for this purpose, so also in your case - by replacing acc
through out[b,i,j,n]
:
for n in range(num_filters):
out[b, i, j, n] = 0.
for kw in range(kernel_size0):
for kh in range(kernel_size1):
for m in range(num_dim):
out[b, i, j, n] += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
out[b, i, j, n] += bias[n]
Upvotes: 1