Olaf Schumann
Olaf Schumann

Reputation: 281

Performance degradation due to loading a shared library with thread local storage

I write a python wrapper around a large Fortran program with pybind11 as a python module. The Fortran program is a large simulation tool, that uses OpenMP for multithreading. My initial work was to reproduce the Fortran executable from a python function. That yielded (as expected) exactly the same results and the same performance. But when I started to add more functions, I observed a large performance degradation (about 50% to 100% longer runtimes).

Tracking the cause in pybind11

I could track it down to a call of the pybind11 macro PYBIND11_NUMPY_DTYPE, which loads in its internals the numpy library numpy.core._multiarray_umath. I could reproduce the performance degradation with the following code:

import ctypes
import time

# This is the fortran code, compiled to a shared library and a subroutine modulemain, that resembles the main program.
fcode = ctypes.CDLL("./libfcode.so")

# Only loading the library results in a worse performance of the Fortran code.
import numpy.core._multiarray_umath

t = time.time()
fcode.modulemain()
print("runtime: ", time.time()-t)

Tracking the cause in numpy

After finding, that the reason of my bad performance lies just in including the numpy.core._multiarray_umath library, I further digged into it. Ultimately I could track it down to two lines in that library, where two variables with thread local storage a defined.

// from numpy 1.21.5, numpy/core/src/multiarray/multiarraymodule.c:4011
static NPY_TLS int sigint_buf_init = 0;
static NPY_TLS NPY_SIGJMP_BUF _NPY_SIGINT_BUF;

where NPY_TLSis defined as

#define NPY_TLS __thread

So the inclusion of a shared object with __thread TLS is the root cause for my performance degradation. This leads me to my two questions:

  1. Why?
  2. Is there any way to prevent it? Not using PYBIND11_NUMPY_DTYPE is no option, as the loading of the numpy library after my module will trigger the bug as well!

Minimal working example

My error is about a large and heavy Fortran code, that I wanted to export to python via pybind11. But in the end it resulted in a problem of using OpenMP thread local storage and then loading a library that exports a variable with __thread thread local storage in the python interpreter. I could create a minimal working example, that reproduced the behavior.

The worker program work.f90

module data
  integer, parameter :: N = 10000
  real :: X(1:N)
  !$omp threadprivate(X)
end module

subroutine work() bind(C, name="worker")
  use data, only: X,N

  !$omp parallel
  X(1) = 0.131
  do i=2,N
    do j=1,i-1
      X(i) = X(i) + 0.431*sin(X(i-1))
    end do
  end do
  !$omp end parallel

The bad library tl.c

__thread int badVariable = 3;

a python script that shows the effect run.py

import ctypes
import time

work = ctypes.CDLL("./libwork.so")

# first worker run without loaded libtl.so. Good performance!
t = time.time()
work.worker()
print("TIME: ", time.time()-t)

# load the bad library
bad = ctypes.CDLL("./libtl.so")

# second worker with degraded performance
t = time.time()
work.worker()
print("TIME: ", time.time()-t)

The Makefile

FLAGS = -fPIC  -shared

all: libwork.so libtl.so

libwork.so: work.f90
        gfortran-11 $(FLAGS) work.f90 -fopenmp -o $@

libtl.so: tl.c
        gcc-11 $(FLAGS) tl.c -o $@

The worker is so simple, that enabling optimization will hide the effect. I guess is could be a call to access the thread local storage area, that could be easily optimized out here. But in a real program, the effect is there with optimization.

Setup

I have the problem on a ubuntu 22.04 LTS computer with a x86 CPU (Xeon 8280M). gcc is Ubuntu 11.3.0-1ubuntu1~22.04 (I tried others down to 7.5.0 with the same effect). Python is version 3.10.6. The problem is not Fortran specific, I can easily write a worker in plain C with the same effect. And I also tried this on a Raspberry Pi with the same effect! (ARM, GCC 8.3.0, Python 2.7.16)

Upvotes: 2

Views: 165

Answers (0)

Related Questions