Roman
Roman

Reputation: 1513

atomic fetch_add vs add performance

The code below demonstrates curiosities of multi-threaded programming. In particular the performance of std::memory_order_relaxed increment vs regular increment in a single thread. What I do not understand why fetch_add(relaxed) single-threaded is twice slower than a regular increment.

static void BM_IncrementCounterLocal(benchmark::State& state) {
  volatile std::atomic_int val2;

  while (state.KeepRunning()) {
    for (int i = 0; i < 10; ++i) {
      DoNotOptimize(val2.fetch_add(1, std::memory_order_relaxed));
    }
  }
}
BENCHMARK(BM_IncrementCounterLocal)->ThreadRange(1, 8);

static void BM_IncrementCounterLocalInt(benchmark::State& state) {
  volatile int val3 = 0;

  while (state.KeepRunning()) {
    for (int i = 0; i < 10; ++i) {
      DoNotOptimize(++val3);
    }
  }
}
BENCHMARK(BM_IncrementCounterLocalInt)->ThreadRange(1, 8);

Output:

      Benchmark                               Time(ns)    CPU(ns) Iterations
      ----------------------------------------------------------------------
      BM_IncrementCounterLocal/threads:1            59         60   11402509                                 
      BM_IncrementCounterLocal/threads:2            30         61   11284498                                 
      BM_IncrementCounterLocal/threads:4            19         62   11373100                                 
      BM_IncrementCounterLocal/threads:8            17         62   10491608

      BM_IncrementCounterLocalInt/threads:1         31         31   22592452                                 
      BM_IncrementCounterLocalInt/threads:2         15         31   22170842                                 
      BM_IncrementCounterLocalInt/threads:4          8         31   22214640                                 
      BM_IncrementCounterLocalInt/threads:8          9         31   21889704  

Upvotes: 7

Views: 4490

Answers (2)

With the volatile int, the compiler must ensure that it does not optimize away and/or reorder any reads/writes of the variable.

With the fetch_add, the CPU must take precautions that the read-modify-write operation is atomic.

These are two completely different requirements: The atomicity requirement means that the CPU has to communicate with other CPUs on your machine, ensuring that they don't read/write the given memory location between its own read and write. If the compiler compiles the fetch_add using a compare-and-swap instruction, it will actually emit a short loop to catch the case that some other CPU modified the value in between.

For the volatile int no such communication is necessary. On the contrary, volatile requires that the compiler does not invent any reads: volatile was designed for single thread communication with hardware registers, where the mere act of reading the value may have side effects.

Upvotes: 2

The local version is not using atomics. (The fact it is using volatile is a red herring - volatile has essentially no meaning in multi-threaded code).

The atomic version is using atomics (!). The fact that only one thread is actually going to be use accessing the variable is invisible to the CPU, and I'm not surprised the compiler hasn't spotted it either. (There's no point wasting developer effort figuring out if it is safe to turn std::atomic_int into int, when it almost never will be. Nobody will write atomic_int if they don't need to access it from multiple threads.)

As such, the atomic version will go to the trouble of making sure the increment actually is atomic, and frankly, I'm surprised that is only 2x slower - I would have expected more like 10x.

Upvotes: 0

Related Questions