Mark
Mark

Reputation: 629

Linking cython code against libiomp5 instead of libgomp

Is it possible to link cython code which uses OMP (say things like "prange" statements) against libiomp5 instead of libgomp using gcc? I am aware of several posts, e.g., like Telling GCC to *not* link libgomp so it links libiomp5 instead, and others, describing how one might achieve this. However, they do not seem to work for me. What am I doing wrong?

Specifically, say I am using a most recent Anaconda distribution and have some file.pyx on which I do cython -a file.pyx to get file.c. Then for libgomp I would do things like

gcc -shared -pthread -fPIC -fwrapv -O3 -ffast-math -fno-strict-aliasing -march=native -fopenmp -o file.so -I/include_dirs file.c 

Which gives me a file.so that shows

>ldd file.so
...
libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00007fc3725ab000)
...

For libiomp5, and from reading the previously mentioned posts, I was hoping that this would do the job

gcc -shared -pthread -fPIC -fwrapv -O3 -ffast-math -fno-strict-aliasing -march=native -o file.so -I/include_dirs file.c -L/lib_dirs -liomp5

Indeed, the file.so I get shows

>ldd *.so
...
libiomp5.so => /lib_dirs/libiomp5.so (0x00007ff92c717000)
...

However, when I link file.so to some code which is forced to use a specific number of OMP threads, only the version of file.so which has been linked against libgomp shows more than a single thread being used. I.e. there seems to be no error from linking to libiomp5, but the system behaves as if no OMP pragmas would have been used in the first place.

PS.: I have also tried an additional -Wl,--as-needed to the gcc options (dunnowhatfor), but that does not change the picture.

UPDATE: ----------------

Following the request of user vidyalatha-intel, the following provides an example. It is not coded for optimal style, nor solves any particular problem. It is just meant to allow to reproduce the issue.

A) Some python code to invoke a *.so lib

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import numpy.random as rnd

import stackovfl_OMP as sc  # THE lib

N = 600
# init a couple of 1d and 2d arrays
f = rnd.random(N)
e = rnd.random(N)
v = rnd.random((N,N))+1j*rnd.random((N,N))
z = np.linspace(0,3,150) + .05*1j
numthread = 4                      # explicitly force 4 threads
s = []

for i in z: # for each z do stuff needing OMP in sc.sit
    print(np.real(i))
    s.append([np.real(i),sc.sit(i,v,e,f,numthread)])

B) The cython code stackovfl_OMP.pyx for the lib stackovfl_OMP.so, doing some (rather senseless) stuff, including three loops, the outer one of which uses OMP

# -*- coding: utf-8 -*-
# cython: language_level=3

cimport cython
cimport openmp
from cython.parallel import prange
import numpy as np
cimport numpy as np

@cython.cdivision(True)
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.complex128_t sit(
    np.complex128_t z,
    np.ndarray[np.complex128_t,ndim=2] t,
    np.ndarray[np.float64_t,ndim=1]    e,
    np.ndarray[np.float64_t,ndim=1]    f,
    np.int32_t nt                      # num threads
    ):
    cdef int a,b,c,l,it
    l = len(e)
    ''' si  :   Partial storage for each thread in the pool
        siv :   Nogil memviews for numpy array r/w access '''
    cdef np.ndarray[np.complex128_t,ndim=1] si = np.zeros(nt,dtype=np.complex128)
    cdef complex[:] siv = si
    
    for a in prange(l, nogil=True, num_threads=nt):     # OMP parallelization outer loop
        it = openmp.omp_get_thread_num()                # fixed within one thread
        for b in range(l):
            for c in range(l):                          # Do 'something'
                siv[it] = siv[it] + t[c,b]*t[b,a]*t[a,c]/(2*z-e[a]+e[c])*(                    
                    (f[a]-f[b])/(z-e[a]+e[b]) + (f[c]-f[b])/(z-e[b]+e[c]))
                
    return np.sum(si)                                   # return collected pool

With A) an B) you can go ahead as described in the original post and generate stackovfl_OMP.so, either linking against libgomp or libiomp5. As stated there, only for libgomp the machine ends up with four threads running, when you call python stackovfl.py, while the libiomp5 linked version of stackovfl_OMP.so remains with single thread only. (Additionally exporting OMP_NUM_THREADS=4 into the environment does not change this.)

Upvotes: 0

Views: 229

Answers (1)

Mark
Mark

Reputation: 629

This question has been mentioned by me at the Google cython-users group at https://groups.google.com/g/cython-users/c/2niCShTH4OE where the answer was finally given by cython core-developer D. Woods.

In a nutshell: Split the compile and link step. In the compile step you can then emit the OMP pragmas. The resulting *.o object file can indeed be linked directly to libiomp5. In the example I gave this boils down to, e.g.:

gcc -c -pthread -fPIC -fwrapv -O3 -ffast-math -fno-strict-aliasing -march=native -fopenmp stackovfl_OMP.c -o stackovfl_OMP.o -I/include_paths

gcc -shared -pthread -Wl,--as-needed stackovfl_OMP.o -o stackovfl_OMP.so -L/library_paths -liomp5

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/library_paths

Finally, ldd shows stackovfl_OMP.so to be linked against libiomp5 and it runs as many threads as one likes. (...do I get any performance difference w.r.t. libgomp ... nope.)

Thx. D. Woods. Credit to those who deserve it.

Upvotes: 0

Related Questions