Parallelism using prange in Cython fails

Question

I try to use prange to accelerate the code but the time cost is almost the same with initial version. The initial version is as follows:

%%cython -a --cplus
import cython

cdef extern from "" namespace "std" nogil:
    double complex exp(double complex)

@cython.boundscheck(False)
@cython.wraparound(False)    
def cphaseshift(double px,double py,double[:,:] kX,double[:,:] kY,double complex[:,:] f):
    cdef double complex I = 1j
    cdef double pi = 3.141592653589793
    cdef int i
    cdef int j
    for i in range(kX.shape[0]):
        for j in range(kY.shape[1]):
            f[i,j] = f[i,j]*exp(-2*pi*I*(px*kX[i,j]+py*kY[i,j]))

Speed of the initial version:

px = 1
py = 1
kx = np.linspace(1,256,256)
kX,kY = np.meshgrid(kx,kx)
f0 = np.ones_like(kX,dtype='complex128')

%timeit cphaseshift(px,py,kX,kY,f0)

2.07 ms ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The parallel version:

%%cython -a --cplus
import cython
from cython.parallel import prange

cdef extern from "" namespace "std" nogil:
    double complex exp(double complex)

@cython.boundscheck(False)
@cython.wraparound(False)    
def cphaseshift_para(double px,double py,double[:,:] kX,double[:,:] kY,double complex[:,:] f):
    cdef double complex I = 1j
    cdef double pi = 3.141592653589793
    cdef int i
    cdef int j
    for i in prange(kX.shape[0],nogil=True,num_threads=6):
        for j in range(kY.shape[1]):
            f[i,j] = f[i,j]*exp(-2*pi*I*(px*kX[i,j]+py*kY[i,j]))

Speed of this version:

px = 1
py = 1
kx = np.linspace(1,256,256)
kX,kY = np.meshgrid(kx,kx)
f0 = np.ones_like(kX,dtype='complex128')

%timeit cphaseshift_para(px,py,kX,kY,f0)

2.12 ms ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I notice for i in prange(kX.shape[0],nogil=True,num_threads=6): hints at python interaction. How can I use the parallel correctly to speed up the code? Thanks!

ead · Accepted Answer

prange uses OpenMP for parallelization of calculations. However, the compiler and (on some platforms) the linker need to be used with special flags in order to activate OpenMP directives, which aren't activated per default.

For IPython this can be done via

%%cython -c=-fopenmp --link-args=-fopenmp
....

on Linux with gcc.

On Windows with MSVC the arguments would be:

%%cython -c=/openmp 
....

(no additional link-argument is needed in this case).

If one uses a setup-file, then the Extension needs -fopenmp/copenmp be passed via extra_compile_args and extra_link_args.

I.e. on Linux+gcc:

...
ext_modules = [
    Extension(
        "myextension",
        ["myextension.pyx"],
        extra_compile_args=['-fopenmp'],
        extra_link_args=['-fopenmp'],
    )
]
...

On Windows+MSVC only extra_compile_args are needed:

...
ext_modules = [
    Extension(
        "myextension",
        ["myextension.pyx"],
        extra_compile_args=['/openmp'],
    )
]
...

Compiling with openMP enabled leads on my machine to a speed-up factor of about 3:

%timeit cphaseshift(px,py,kX,kY,f0)
# 2.87 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit cphaseshift_para(px,py,kX,kY,f0)
# 912 µs ± 36.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You cannot avoid Python-interactions in the

for i in prange(kX.shape[0],nogil=True,num_threads=6):

line, because the GIL needs to be released and then acquired again - both actions need interaction with Python-interpreter.

Parallelism using prange in Cython fails

Answers (1)

Related Questions