Swiss Frank
Swiss Frank

Reputation: 2422

any performance penalto to be expected with thread_local?

Using C++11 and/or C11 thread_local, should we expect any performance penalty over non-thread_local storage on x86 (32- or 64-bit) Linux, Red Hat 5 or newer, with a recent g++/gcc (say, version 4 or newer) or clang?

Upvotes: 0

Views: 850

Answers (1)

Maxim Egorushkin
Maxim Egorushkin

Reputation: 136208

On Ubuntu 18.04 x86_64 with gcc-8.3 (options -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG) the difference is almost imperceptible:

#include <benchmark/benchmark.h>

struct A { static unsigned n; };
unsigned A::n = 0;

struct B { static thread_local unsigned n; };
thread_local unsigned B::n = 0;

template<class T>
void bm(benchmark::State& state) {
    for(auto _ : state)
        benchmark::DoNotOptimize(++T::n);
}

BENCHMARK_TEMPLATE(bm, A);
BENCHMARK_TEMPLATE(bm, B);
BENCHMARK_MAIN();

Results:

Run on (16 X 5000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.59, 0.49, 0.38
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
bm<A>            1.09 ns         1.09 ns    642390002
bm<B>            1.09 ns         1.09 ns    633963210

On x86_64 thread_local variables are accessed relative to fs register. Instructions with such addressing mode are often 2 bytes longer, so theoretically, they can take more time.

On other platforms it depends on how access to thread_local variables is implemented. See ELF Handling For Thread-Local Storage for more details.

Upvotes: 3

Related Questions