George
George

Reputation: 4011

Is gcc-c++ not optimizing atomic operations for current x86-64 processors

Given the following test program:

#include <atomic>
#include <iostream>

int64_t process_one() {
        int64_t a;
        //Should be atomic on my haswell
        int64_t assign = 42;
        a = assign;
        return a;
}

int64_t process_two() {
        std::atomic<int64_t> a;
        int64_t assign = 42;
        a = assign;
        return a;
}

int main() {
        auto res_one = process_one();
        auto res_two = process_two();
        std::cout << res_one << std::endl;
        std::cout << res_two << std::endl;
}

Compiled with:

g++ --std=c++17 -O3 -march=native main.cpp

The code generated the following asm for the two functions:

00000000004007c0 <_Z11process_onev>:
  4007c0:       b8 2a 00 00 00          mov    $0x2a,%eax
  4007c5:       c3                      retq
  4007c6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4007cd:       00 00 00

00000000004007d0 <_Z11process_twov>:
  4007d0:       48 c7 44 24 f8 2a 00    movq   $0x2a,-0x8(%rsp)
  4007d7:       00 00
  4007d9:       0f ae f0                mfence
  4007dc:       48 8b 44 24 f8          mov    -0x8(%rsp),%rax
  4007e1:       c3                      retq
  4007e2:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4007e9:       00 00 00
  4007ec:       0f 1f 40 00             nopl   0x0(%rax)

Personally I don't speak much assembler but (and I might be mistaken here) it seems that process_two compiled to include all of process_one's and then some.

However, as far as I know, 'modern' x86-64 processors (e.g. Haswell, on which I compiled this) will do assignment atomically without the need for any extra operations (in this case I believe the extra operation is the mfence instruction in process_two).

So why wouldn't gcc just optimize the code in process two to behave exactly the case as process one ? Given the flags I compiled with.

Are there still cases where an atomic store behaves differently than an assignment to a normal variable given that they are both on 8 bytes.

Upvotes: 4

Views: 561

Answers (2)

BeeOnRope
BeeOnRope

Reputation: 64905

It's just a missed optimization. For example, clang does just fine with it - both functions compile identically as a single mov eax, 42.

Now, you'd have to dig into the gcc internals to be sure, but it seems to be that gcc has not yet implemented many common and legal optimizations around atomic variables, including merging consecutive reads and writes. In fact, none of clang, icc or gcc seem to optimize much of anything yet except that clang handles local atomics (including passed-by-value) by essentially removing their atomic nature, which is useful in some cases such as generic code. Sometimes icc seems to generate especially bad code - see two_reads here, for example: it seems to only ever want to use rax as the address and as the accumulator, resulting in a stream of mov instructions shuffling things around.

Some more complex issues around atomic optimization are discussed here and I expect compilers will get better at this over time.

Upvotes: 2

Marek Vitek
Marek Vitek

Reputation: 1633

The reason for it is that default use of std::atomic also implies memory order

std::memory_order order = std::memory_order_seq_cst

To achieve this consistency the compiler has to tell processor to not reorder instructions. And it does by using mfence instruction.

Change your

    a = assign;

to

    a.store(assign, std::memory_order_relaxed);

and your output will change from

process_two():
        mov     QWORD PTR [rsp-8], 42
        mfence
        mov     rax, QWORD PTR [rsp-8]
        ret

to

process_two():
        mov     QWORD PTR [rsp-8], 42
        mov     rax, QWORD PTR [rsp-8]
        ret

Just as you expected it to be.

Upvotes: 10

Related Questions