Reputation: 437
I've started messing around with parallel programming and cython/openmp, and I have a simple program that sums over an array using prange:
import numpy as np
from cython.parallel import prange
from cython import boundscheck, wraparound
@boundscheck(False)
@wraparound(False)
def parallel_summation(double[:] vec):
cdef int n = vec.shape[0]
cdef double total
cdef int i
for i in prange(n, nogil=True):
total += vec[i]
return total
It seems to work OK with a setup.py file. However, I was wondering if it is possible to adjust this function and have a little more control over what the processors are doing.
Let's say I have 4 processors: I want to split the vector to be summed into 4 parts, and then have each processor locally add the elements inside. Then at the end, I can combine the results from each processor to get the total sum. From the cython documentation, I wasn't able to gather whether something like this is possible or not (the documentation is a little sparse).
I'd appreciate if someone could explain if/how something like this is done using cython/openmp, or maybe help locate some relevant examples (its been surprisingly hard to find simple ones online).
Upvotes: 4
Views: 107
Reputation: 22670
I want to split the vector to be summed into 4 parts, and then have each processor locally add the elements inside. Then at the end, I can combine the results from each processor to get the total sum.
That's exactly what's happening here already. Cython infers from your inplace operation that you want to do a reduction. OpenMP will implement a parallel loop with private (zero initialized) copies of the total
variable and add them all to total
at the end of the loop.
In the generated C, this looks like this:
#pragma omp parallel
{
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) reduction(+:__pyx_v_total)
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
{
__pyx_v_i = (int)(0 + 1 * __pyx_t_2);
__pyx_t_4 = __pyx_v_i;
__pyx_v_total = (__pyx_v_total + (*((double *) ( /* dim=0 */ (__pyx_v_vec.data + __pyx_t_4 * __pyx_v_vec.strides[0]) ))));
}
}
}
You just need to enable OpenMP as described here.
The one thing that you should change in your code, is to initialize total = 0
, otherwise it's just an unitialized C variable with may contain garbage.
Upvotes: 1