Prange slowing down Cython loop

Question

Consider two ways of calculating random numbers, one in one thread and one multithread using cython prange with openmp:

def rnd_test(long size1):
    cdef long i
    for i in range(size1):
        rand()
    return 1

and

def rnd_test_par(long size1):
    cdef long i
    with nogil, parallel():
        for i in prange(size1, schedule='static'):
             rand()
    return 1

Function rnd_test is first compiled with the following setup.py

from distutils.core import setup
from Cython.Build import cythonize

setup(
  name = 'Hello world app',
  ext_modules = cythonize("cython_test.pyx"),
)

rnd_test(100_000_000) runs in 0.7s.

Then, rnd_test_par is compiled with the following setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize

ext_modules = [
    Extension(
        "cython_test_openmp",
        ["cython_test_openmp.pyx"],
        extra_compile_args=["-O3", '-fopenmp'],
        extra_link_args=['-fopenmp'],
    )

]

setup(
    name='hello-parallel-world',
    ext_modules=cythonize(ext_modules),
)

rnd_test_par(100_000_000) runs in 10s!!!

Similar results are obtained using cython within ipython:

%%cython
import cython
from cython.parallel cimport parallel, prange
from libc.stdlib cimport rand

def rnd_test(long size1):
    cdef long i
    for i in range(size1):
        rand()
    return 1

%%timeit
rnd_test(100_000_000)

1 loop, best of 3: 1.5 s per loop

and

%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
import cython
from cython.parallel cimport parallel, prange
from libc.stdlib cimport rand

def rnd_test_par(long size1):
    cdef long i
    with nogil, parallel():
        for i in prange(size1, schedule='static'):
                rand()
    return 1

%%timeit
rnd_test_par(100_000_000)

1 loop, best of 3: 8.42 s per loop

What am I doing wrong? I am completely new to cython, this is my second time using it. I had a good experience last time so I decided to use for a project with monte-carlo simulation (hence the use of rand).

Is this expected? Having read all the documentation, I think prange should work well in an embarrassingly parallel case like this. I don't understand why this is failing to speed up the loop and even making it so much slower.

Some additional information:

I am running python 3.6, cython 0.26.
gcc version is "gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
CPU usage confirms the parallel version is actually using many cores (90% vs 25% of the serial case)

I appreciate any help you can provide. I tried first with numba and it did speed up the calculation but it has other problems that make me want to avoid it. I'd like Cython to work in this case.

Thanks!!!

Luk17 · Accepted Answer

With DavidW's useful feedback and links, I have a multithreaded solution for random number generation. However, the time savings over single-threaded (vectorized) Numpy solution are not that massive. The numpy approach generates 100 million numbers (5GB in memory) in 1.2s versus 0.7s of the multithreaded approach. Given the increased complexity (using c++ libraries for example), I wonder if it's worth it. Maybe I will leave the random number generation single-threaded and work on parallelizing the calculations that follow this step. The exercise is, however, very useful to understand the problems of randon number generators. Ultimately, I'd like to have framework that could work in a distributed environment and I can see now that the challenge would be even larger in regards to the random number generator due to generators essentially having a state that cannot be ignored.

%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11
import cython
cimport numpy as np
import numpy as np
from cython.parallel cimport parallel, prange, threadid
cimport openmp

cdef extern from "" namespace "std" nogil:
    cdef cppclass mt19937:
        mt19937() # we need to define this constructor to stack allocate classes in Cython
        mt19937(unsigned int seed) # not worrying about matching the exact int type for seed

    cdef cppclass uniform_real_distribution[T]:
        uniform_real_distribution()
        uniform_real_distribution(T a, T b)
        T operator()(mt19937 gen) # ignore the possibility of using other classes for "gen"

@cython.boundscheck(False)
@cython.wraparound(False)        
def test_rnd_par(long size):
    cdef:
        mt19937 gen
        uniform_real_distribution[double] dist = uniform_real_distribution[double](0.0,1.0)
        narr = np.empty(size, dtype=np.dtype("double"))
        double [:] narr_view = narr
        long i

    with nogil, parallel():
        gen = mt19937(openmp.omp_get_thread_num())
        for i in prange(size, schedule='static'):
            narr_view[i] = dist(gen)
    return narr

Prange slowing down Cython loop

Answers (2)

Related Questions