Reputation: 4011
Given the following test program:
#include <atomic>
#include <iostream>
int64_t process_one() {
int64_t a;
//Should be atomic on my haswell
int64_t assign = 42;
a = assign;
return a;
}
int64_t process_two() {
std::atomic<int64_t> a;
int64_t assign = 42;
a = assign;
return a;
}
int main() {
auto res_one = process_one();
auto res_two = process_two();
std::cout << res_one << std::endl;
std::cout << res_two << std::endl;
}
Compiled with:
g++ --std=c++17 -O3 -march=native main.cpp
The code generated the following asm for the two functions:
00000000004007c0 <_Z11process_onev>:
4007c0: b8 2a 00 00 00 mov $0x2a,%eax
4007c5: c3 retq
4007c6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4007cd: 00 00 00
00000000004007d0 <_Z11process_twov>:
4007d0: 48 c7 44 24 f8 2a 00 movq $0x2a,-0x8(%rsp)
4007d7: 00 00
4007d9: 0f ae f0 mfence
4007dc: 48 8b 44 24 f8 mov -0x8(%rsp),%rax
4007e1: c3 retq
4007e2: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4007e9: 00 00 00
4007ec: 0f 1f 40 00 nopl 0x0(%rax)
Personally I don't speak much assembler but (and I might be mistaken here) it seems that process_two compiled to include all of process_one's and then some.
However, as far as I know, 'modern' x86-64 processors (e.g. Haswell, on which I compiled this) will do assignment atomically without the need for any extra operations (in this case I believe the extra operation is the mfence
instruction in process_two).
So why wouldn't gcc just optimize the code in process two to behave exactly the case as process one ? Given the flags I compiled with.
Are there still cases where an atomic store behaves differently than an assignment to a normal variable given that they are both on 8 bytes.
Upvotes: 4
Views: 561
Reputation: 64905
It's just a missed optimization. For example, clang does just fine with it - both functions compile identically as a single mov eax, 42
.
Now, you'd have to dig into the gcc
internals to be sure, but it seems to be that gcc
has not yet implemented many common and legal optimizations around atomic variables, including merging consecutive reads and writes. In fact, none of clang
, icc
or gcc
seem to optimize much of anything yet except that clang
handles local atomics (including passed-by-value) by essentially removing their atomic nature, which is useful in some cases such as generic code. Sometimes icc
seems to generate especially bad code - see two_reads
here, for example: it seems to only ever want to use rax
as the address and as the accumulator, resulting in a stream of mov
instructions shuffling things around.
Some more complex issues around atomic optimization are discussed here and I expect compilers will get better at this over time.
Upvotes: 2
Reputation: 1633
The reason for it is that default use of std::atomic
also implies memory order
std::memory_order order = std::memory_order_seq_cst
To achieve this consistency the compiler has to tell processor to not reorder instructions. And it does by using mfence instruction.
Change your
a = assign;
to
a.store(assign, std::memory_order_relaxed);
and your output will change from
process_two():
mov QWORD PTR [rsp-8], 42
mfence
mov rax, QWORD PTR [rsp-8]
ret
to
process_two():
mov QWORD PTR [rsp-8], 42
mov rax, QWORD PTR [rsp-8]
ret
Just as you expected it to be.
Upvotes: 10