Reputation: 281
I write a python wrapper around a large Fortran program with pybind11 as a python module. The Fortran program is a large simulation tool, that uses OpenMP for multithreading. My initial work was to reproduce the Fortran executable from a python function. That yielded (as expected) exactly the same results and the same performance. But when I started to add more functions, I observed a large performance degradation (about 50% to 100% longer runtimes).
I could track it down to a call of the pybind11 macro PYBIND11_NUMPY_DTYPE
, which loads in its internals the numpy library numpy.core._multiarray_umath
. I could reproduce the performance degradation with the following code:
import ctypes
import time
# This is the fortran code, compiled to a shared library and a subroutine modulemain, that resembles the main program.
fcode = ctypes.CDLL("./libfcode.so")
# Only loading the library results in a worse performance of the Fortran code.
import numpy.core._multiarray_umath
t = time.time()
fcode.modulemain()
print("runtime: ", time.time()-t)
After finding, that the reason of my bad performance lies just in including the numpy.core._multiarray_umath
library, I further digged into it. Ultimately I could track it down to two lines in that library, where two variables with thread local storage a defined.
// from numpy 1.21.5, numpy/core/src/multiarray/multiarraymodule.c:4011
static NPY_TLS int sigint_buf_init = 0;
static NPY_TLS NPY_SIGJMP_BUF _NPY_SIGINT_BUF;
where NPY_TLS
is defined as
#define NPY_TLS __thread
So the inclusion of a shared object with __thread
TLS is the root cause for my performance degradation. This leads me to my two questions:
PYBIND11_NUMPY_DTYPE
is no option, as the loading of the numpy library after my module will trigger the bug as well!My error is about a large and heavy Fortran code, that I wanted to export to python via pybind11. But in the end it resulted in a problem of using OpenMP thread local storage and then loading a library that exports a variable with __thread
thread local storage in the python interpreter. I could create a minimal working example, that reproduced the behavior.
The worker program work.f90
module data
integer, parameter :: N = 10000
real :: X(1:N)
!$omp threadprivate(X)
end module
subroutine work() bind(C, name="worker")
use data, only: X,N
!$omp parallel
X(1) = 0.131
do i=2,N
do j=1,i-1
X(i) = X(i) + 0.431*sin(X(i-1))
end do
end do
!$omp end parallel
The bad library tl.c
__thread int badVariable = 3;
a python script that shows the effect run.py
import ctypes
import time
work = ctypes.CDLL("./libwork.so")
# first worker run without loaded libtl.so. Good performance!
t = time.time()
work.worker()
print("TIME: ", time.time()-t)
# load the bad library
bad = ctypes.CDLL("./libtl.so")
# second worker with degraded performance
t = time.time()
work.worker()
print("TIME: ", time.time()-t)
The Makefile
FLAGS = -fPIC -shared
all: libwork.so libtl.so
libwork.so: work.f90
gfortran-11 $(FLAGS) work.f90 -fopenmp -o $@
libtl.so: tl.c
gcc-11 $(FLAGS) tl.c -o $@
The worker is so simple, that enabling optimization will hide the effect. I guess is could be a call to access the thread local storage area, that could be easily optimized out here. But in a real program, the effect is there with optimization.
I have the problem on a ubuntu 22.04 LTS computer with a x86 CPU (Xeon 8280M). gcc is Ubuntu 11.3.0-1ubuntu1~22.04 (I tried others down to 7.5.0 with the same effect). Python is version 3.10.6. The problem is not Fortran specific, I can easily write a worker in plain C with the same effect. And I also tried this on a Raspberry Pi with the same effect! (ARM, GCC 8.3.0, Python 2.7.16)
Upvotes: 2
Views: 165