Is gcc-c++ not optimizing atomic operations for current x86-64 processors

Question

Given the following test program:

#include 
#include 

int64_t process_one() {
        int64_t a;
        //Should be atomic on my haswell
        int64_t assign = 42;
        a = assign;
        return a;
}

int64_t process_two() {
        std::atomic a;
        int64_t assign = 42;
        a = assign;
        return a;
}

int main() {
        auto res_one = process_one();
        auto res_two = process_two();
        std::cout << res_one << std::endl;
        std::cout << res_two << std::endl;
}

Compiled with:

g++ --std=c++17 -O3 -march=native main.cpp

The code generated the following asm for the two functions:

00000000004007c0 <_Z11process_onev>:
  4007c0:       b8 2a 00 00 00          mov    $0x2a,%eax
  4007c5:       c3                      retq
  4007c6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4007cd:       00 00 00

00000000004007d0 <_Z11process_twov>:
  4007d0:       48 c7 44 24 f8 2a 00    movq   $0x2a,-0x8(%rsp)
  4007d7:       00 00
  4007d9:       0f ae f0                mfence
  4007dc:       48 8b 44 24 f8          mov    -0x8(%rsp),%rax
  4007e1:       c3                      retq
  4007e2:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4007e9:       00 00 00
  4007ec:       0f 1f 40 00             nopl   0x0(%rax)

Personally I don't speak much assembler but (and I might be mistaken here) it seems that process_two compiled to include all of process_one's and then some.

However, as far as I know, 'modern' x86-64 processors (e.g. Haswell, on which I compiled this) will do assignment atomically without the need for any extra operations (in this case I believe the extra operation is the mfence instruction in process_two).

So why wouldn't gcc just optimize the code in process two to behave exactly the case as process one ? Given the flags I compiled with.

Are there still cases where an atomic store behaves differently than an assignment to a normal variable given that they are both on 8 bytes.

Marek Vitek · Accepted Answer

The reason for it is that default use of std::atomic also implies memory order

std::memory_order order = std::memory_order_seq_cst

To achieve this consistency the compiler has to tell processor to not reorder instructions. And it does by using mfence instruction.

Change your

    a = assign;

to

    a.store(assign, std::memory_order_relaxed);

and your output will change from

process_two():
        mov     QWORD PTR [rsp-8], 42
        mfence
        mov     rax, QWORD PTR [rsp-8]
        ret

to

process_two():
        mov     QWORD PTR [rsp-8], 42
        mov     rax, QWORD PTR [rsp-8]
        ret

Just as you expected it to be.

Is gcc-c++ not optimizing atomic operations for current x86-64 processors

Answers (2)

Related Questions