Reputation: 1513
The code below demonstrates curiosities of multi-threaded programming. In particular the performance of std::memory_order_relaxed
increment vs regular increment in a single thread. What I do not understand why fetch_add(relaxed) single-threaded is twice slower than a regular increment.
static void BM_IncrementCounterLocal(benchmark::State& state) {
volatile std::atomic_int val2;
while (state.KeepRunning()) {
for (int i = 0; i < 10; ++i) {
DoNotOptimize(val2.fetch_add(1, std::memory_order_relaxed));
}
}
}
BENCHMARK(BM_IncrementCounterLocal)->ThreadRange(1, 8);
static void BM_IncrementCounterLocalInt(benchmark::State& state) {
volatile int val3 = 0;
while (state.KeepRunning()) {
for (int i = 0; i < 10; ++i) {
DoNotOptimize(++val3);
}
}
}
BENCHMARK(BM_IncrementCounterLocalInt)->ThreadRange(1, 8);
Output:
Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_IncrementCounterLocal/threads:1 59 60 11402509 BM_IncrementCounterLocal/threads:2 30 61 11284498 BM_IncrementCounterLocal/threads:4 19 62 11373100 BM_IncrementCounterLocal/threads:8 17 62 10491608 BM_IncrementCounterLocalInt/threads:1 31 31 22592452 BM_IncrementCounterLocalInt/threads:2 15 31 22170842 BM_IncrementCounterLocalInt/threads:4 8 31 22214640 BM_IncrementCounterLocalInt/threads:8 9 31 21889704
Upvotes: 7
Views: 4490
Reputation: 40645
With the volatile int
, the compiler must ensure that it does not optimize away and/or reorder any reads/writes of the variable.
With the fetch_add
, the CPU must take precautions that the read-modify-write operation is atomic.
These are two completely different requirements: The atomicity requirement means that the CPU has to communicate with other CPUs on your machine, ensuring that they don't read/write the given memory location between its own read and write. If the compiler compiles the fetch_add
using a compare-and-swap instruction, it will actually emit a short loop to catch the case that some other CPU modified the value in between.
For the volatile int
no such communication is necessary. On the contrary, volatile
requires that the compiler does not invent any reads: volatile
was designed for single thread communication with hardware registers, where the mere act of reading the value may have side effects.
Upvotes: 2
Reputation: 29017
The local version is not using atomics. (The fact it is using volatile
is a red herring - volatile
has essentially no meaning in multi-threaded code).
The atomic version is using atomics (!). The fact that only one thread is actually going to be use accessing the variable is invisible to the CPU, and I'm not surprised the compiler hasn't spotted it either. (There's no point wasting developer effort figuring out if it is safe to turn std::atomic_int
into int
, when it almost never will be. Nobody will write atomic_int
if they don't need to access it from multiple threads.)
As such, the atomic version will go to the trouble of making sure the increment actually is atomic, and frankly, I'm surprised that is only 2x slower - I would have expected more like 10x.
Upvotes: 0