lack of speedup and erroneous results with OpenMP and Cython

Question

I'm trying to speed up a simple piece of code written in Cython with OpenMP. It's a double loop that for each position in the input array adds a quantity at each reference point. Here's the main part of the code:

cimport cython  
import numpy as np
cimport numpy as np
cimport openmp
DTYPE = np.double
ctypedef np.double_t DTYPE_t

cdef extern from "math.h" nogil :
  DTYPE_t sqrt(DTYPE_t)


@cython.cdivision(True)
@cython.boundscheck(False)
def summation(np.ndarray[DTYPE_t,ndim=2] pos, np.ndarray[DTYPE_t,ndim=1] weights, 
              np.ndarray[DTYPE_t, ndim=2] points, int num_threads = 0):

    from cython.parallel cimport prange, parallel, threadid

    if num_threads <= 0 : 
        num_threads = openmp.omp_get_num_procs()

    if num_threads > openmp.omp_get_num_procs() : 
        num_threads = openmp.omp_get_num_procs()

    openmp.omp_set_num_threads(num_threads)

    cdef unsigned int nips = len(points)
    cdef np.ndarray[DTYPE_t, ndim=1] sum_array = np.zeros(nips, dtype = np.float64)
    cdef np.ndarray[DTYPE_t, ndim=2] sum_array3d = np.zeros((nips,3), dtype = np.float64)

    cdef unsigned int n = len(weights)

    cdef unsigned int pi, i, id
    cdef double dx, dy, dz, dr, weight_i, xi,yi,zi

    print 'num_threads = ', openmp.omp_get_num_threads()

    for i in prange(n,nogil=True,schedule='static'):
        weight_i = weights[i]
        xi = pos[i,0]
        yi = pos[i,1]
        zi = pos[i,2]
        for pi in range(nips) :
            dx = points[pi,0] - xi
            dy = points[pi,1] - yi
            dz = points[pi,2] - zi
            dr = 1.0/sqrt(dx*dx + dy*dy + dz*dz)
            sum_array[pi] += weight_i * dr
            sum_array3d[pi,0] += weight_i * dx
            sum_array3d[pi,1] += weight_i * dy
            sum_array3d[pi,2] += weight_i * dz


    return sum_array, sum_array3d

I've put it into a gist with the associated test and setup files (https://gist.github.com/rokroskar/6ed1bfc1a5f8f9c183a6)

Two issues arise:

First, in the current configuration I'm not able to get any speedup. The code runs on multiple cores, but the timings show no advantage.

Second, the results are different depending on the number of cores indicating that there's a race condition. Aren't the in-place sums supposed to end up being reductions? Or is something funny happening since it's a nested for loop? I figured everything inside the prange is executed in each thread individually.

Both of these go away if I reverse the order of the loops -- but because my outer loop the way it's structured now is where all the data reading gets done, if I reverse them the arrays are traversed num_thread times which is wasteful. I've also tried putting the whole nested loop inside a with parallel(): block and use thread-local buffers explicitly but wasn't able to get it to work.

Clearly I'm missing something basic about how OpenMP is supposed to work (though maybe this is particular to Cython?) so I would be grateful for tips!

lack of speedup and erroneous results with OpenMP and Cython

Answers (1)

Related Questions