Rok
Rok

Reputation: 633

lack of speedup and erroneous results with OpenMP and Cython

I'm trying to speed up a simple piece of code written in Cython with OpenMP. It's a double loop that for each position in the input array adds a quantity at each reference point. Here's the main part of the code:

cimport cython  
import numpy as np
cimport numpy as np
cimport openmp
DTYPE = np.double
ctypedef np.double_t DTYPE_t

cdef extern from "math.h" nogil :
  DTYPE_t sqrt(DTYPE_t)


@cython.cdivision(True)
@cython.boundscheck(False)
def summation(np.ndarray[DTYPE_t,ndim=2] pos, np.ndarray[DTYPE_t,ndim=1] weights, 
              np.ndarray[DTYPE_t, ndim=2] points, int num_threads = 0):

    from cython.parallel cimport prange, parallel, threadid

    if num_threads <= 0 : 
        num_threads = openmp.omp_get_num_procs()

    if num_threads > openmp.omp_get_num_procs() : 
        num_threads = openmp.omp_get_num_procs()

    openmp.omp_set_num_threads(num_threads)

    cdef unsigned int nips = len(points)
    cdef np.ndarray[DTYPE_t, ndim=1] sum_array = np.zeros(nips, dtype = np.float64)
    cdef np.ndarray[DTYPE_t, ndim=2] sum_array3d = np.zeros((nips,3), dtype = np.float64)

    cdef unsigned int n = len(weights)

    cdef unsigned int pi, i, id
    cdef double dx, dy, dz, dr, weight_i, xi,yi,zi

    print 'num_threads = ', openmp.omp_get_num_threads()

    for i in prange(n,nogil=True,schedule='static'):
        weight_i = weights[i]
        xi = pos[i,0]
        yi = pos[i,1]
        zi = pos[i,2]
        for pi in range(nips) :
            dx = points[pi,0] - xi
            dy = points[pi,1] - yi
            dz = points[pi,2] - zi
            dr = 1.0/sqrt(dx*dx + dy*dy + dz*dz)
            sum_array[pi] += weight_i * dr
            sum_array3d[pi,0] += weight_i * dx
            sum_array3d[pi,1] += weight_i * dy
            sum_array3d[pi,2] += weight_i * dz


    return sum_array, sum_array3d

I've put it into a gist with the associated test and setup files (https://gist.github.com/rokroskar/6ed1bfc1a5f8f9c183a6)

Two issues arise:

First, in the current configuration I'm not able to get any speedup. The code runs on multiple cores, but the timings show no advantage.

Second, the results are different depending on the number of cores indicating that there's a race condition. Aren't the in-place sums supposed to end up being reductions? Or is something funny happening since it's a nested for loop? I figured everything inside the prange is executed in each thread individually.

Both of these go away if I reverse the order of the loops -- but because my outer loop the way it's structured now is where all the data reading gets done, if I reverse them the arrays are traversed num_thread times which is wasteful. I've also tried putting the whole nested loop inside a with parallel(): block and use thread-local buffers explicitly but wasn't able to get it to work.

Clearly I'm missing something basic about how OpenMP is supposed to work (though maybe this is particular to Cython?) so I would be grateful for tips!

Upvotes: 6

Views: 1138

Answers (1)

Alex Rubinsteyn
Alex Rubinsteyn

Reputation: 420

Have you tried switching the two loops so you don't have multiple threads reading and writing from the same locations? I'm pretty sure that these loops are not automatically promoted into an OpenMP reduction and increments such as "sum_array3d[pi,0] += weight_i * dx" are not atomic.

Also, since the computation is relatively simple, Cython might be overkill and you might get away with using either Parakeet or Numba instead.

Parakeet will, by default, use OpenMP to execute comprehensions in parallel. You would have to rewrite your code to look something like:

@parakeet.jit
def summation(pos, weights, points):
  n_points = len(points)
  n_weights = len(weights)
  sum_array3d = np.zeros((n_points,3))
  def compute(i):
    pxi = points[i, 0]
    pyi = points[i, 1]
    pzi = points[i, 2]
    total = 0.0
    for j in xrange(n_weights):
      weight_j = weights[j]
      xj = pos[j,0]
      yj = pos[j,1]
      zj = pos[j,2]
      dx = pxi - pos[j, 0]
      dy = pyi - pos[j, 1]
      dz = pzi - pos[j, 2]
      dr = 1.0/np.sqrt(dx*dx + dy*dy + dz*dz)
      total += weight_j * dr
      sum_array3d[i,0] += weight_j * dx
      sum_array3d[i,1] += weight_j * dy
      sum_array3d[i,2] += weight_j * dz
    return total 
  sum_array = np.array([compute(i) for i in xrange(n_points)])
  return sum_array, sum_array3d

As for Numba, I'm not sure if the prange construct has made it into the free version yet.

edit: Forehead slap, sorry I missed the part of your question where you considered switching the loops.

Upvotes: 2

Related Questions