Reputation: 2422
Using C++11 and/or C11 thread_local
, should we expect any performance penalty over non-thread_local
storage on x86 (32- or 64-bit) Linux, Red Hat 5 or newer, with a recent g++/gcc (say, version 4 or newer) or clang?
Upvotes: 0
Views: 850
Reputation: 136208
On Ubuntu 18.04 x86_64 with gcc-8.3 (options -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG
) the difference is almost imperceptible:
#include <benchmark/benchmark.h>
struct A { static unsigned n; };
unsigned A::n = 0;
struct B { static thread_local unsigned n; };
thread_local unsigned B::n = 0;
template<class T>
void bm(benchmark::State& state) {
for(auto _ : state)
benchmark::DoNotOptimize(++T::n);
}
BENCHMARK_TEMPLATE(bm, A);
BENCHMARK_TEMPLATE(bm, B);
BENCHMARK_MAIN();
Results:
Run on (16 X 5000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.59, 0.49, 0.38
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
bm<A> 1.09 ns 1.09 ns 642390002
bm<B> 1.09 ns 1.09 ns 633963210
On x86_64 thread_local
variables are accessed relative to fs
register. Instructions with such addressing mode are often 2 bytes longer, so theoretically, they can take more time.
On other platforms it depends on how access to thread_local
variables is implemented. See ELF Handling For Thread-Local Storage for more details.
Upvotes: 3