Reputation: 924
Consider the following piece of code:
volatile int a;
volatile int b;
int x;
void func() {
a = 1;
x = 0; /* dummy statement */
b = 2;
}
In this code snippet, the assignment to x constitutes a sequence point. Hence, according to the C90 standard, the access to the volatile variable a must be finished before the access to b is started. When translating this piece of code to x86-64 assembler, the body of the function is translated as follows:
movl $1, a(%rip)
movl $0, x(%rip)
movl $2, b(%rip)
Now, when executing this code, the CPU may reorder the memory accesses, thus breaking the requirement of the C standard that the accesses to a and b are performed in order. So, isn't this translation incorrect, and wouldn't the compiler have to to insert memory barriers to enforce the ordering?
Edit: Consider the case where a and b are variables shared by two threads. In this case, a synchronization protocol between the two threads may rely on the fact that accesses to a and b occur in order. Thus, when the CPU reorders the accesses, this may break that protocol (I'm not actually trying to implement such a protocol, I'm just wondering what the correct interpretation of the C standard is).
Upvotes: 5
Views: 318
Reputation: 365747
volatile
only requires that assembly sequencing (program order) of volatile
operations matches the source / abstract machine. (So no compile-time reordering wrt. other volatile
accesses1, and no optimizing away of accesses.)
If you need runtime ordering of memory-visibility from outside this thread stronger than what the hardware guarantees, you need inline asm for memory barriers, or C11 <stdatomic.h>
.
See When to use volatile with multi threading? - never; C11 and C++11 made previous such usage obsolete. But for small-enough types it's somewhat like _Atomic
with memory_order_relaxed
.
(In terms of pure ISO C11, data races on volatile
are undefined behaviour. It's up to real implementations to say what happens, but all real-world implementations only run threads of the same program across cores with cache-coherent shared memory, so you do at least get visibility.)
MSVC can optionally give volatile
in C/C++ semantics like x86's hardware memory model (memory_order_acq_rel
) even on other ISAs, and not doing compile-time reordering even wrt. non-volatile
accesses. This is where C#'s volatile
semantics came from, IIRC. (Their docs now "strongly recommend" /volatile:iso
and using standard C++ synchronization stuff, not volatile
with /volatile:ms
). None of the other mainstream C and C++ compilers go beyond the ISO spec for ordering of other ops wrt. volatile
.
The existence of C11 stdatomic.h
and _Atomic
makes MS's volatile
shenanigans basically obsolete; just use standard portable atomic_store_explicit(&my_atomic_var, 1, memory_order_release);
It's a lot more typing, especially in C vs. C++ my_atomic_var.store(1, std::memory_order_release)
, but it's portable and well-defined and doesn't need compiler options to work right.
For MMIO purposes, uncacheable memory regions are even more strongly ordered on x86 than normal (cacheable) memory regions. So dev->control_reg = 123;
tmp = dev->data_reg;
should just work.
If there are any ISAs that aren't like that, you might need memory barrier instructions between volatile accesses to control the order in which devices see your stores and loads.
MMIO, and defeating the optimizer for microbenchmarking or debugging purposes, are the main use-cases for volatile
.
"Instruction reodering" is a sloppy way to describe things. add edx, eax
; add eax, ecx
has an anti-dependency (write-after-read) hazard on EAX, so statically reordering the asm source would give wrong results. Register renaming avoids this problem, still allowing out-of-order execution of the operations but keeping track of which version of EAX the first add
reads.
Memory reordering is separate from out-of-order execution; it can happen even on CPUs that begin execution of instruction in program-order. e.g. ones with hit-under-miss and miss-under-miss caches will scoreboard loads and only stall when a later instruction tries to use a load result that still isn't ready. And store buffers are very common, introducing StoreLoad reordering (which even x86 allows).
(You got it right when you said "may reorder the memory accesses" rather than "the instructions" in the question body.)
The cardinal rule of out-of-order exec is: "don't break single-threaded code". It's like C's as-if rule for optimization, where in both cases the visible things that have to be preserved don't include the order of our memory operations seen from outside (loads reading from cache and stores becoming visible). So when using volatile
, that's just up to the hardware memory model. (volatile
can't reorder at compile time with other volatile
loads/stores, but can with other operations.)
That's why the Linux kernel has CPP macros defined for each target it supports, with inline asm
like GNU C asm("lwsync" ::: "memory")
(PowerPC for smp_rmb()
or smp_wmb()
) or asm("" ::: "memory")
for barrier()
or any x86 barrier other than a full barrier).
to x86-64 assembler, [...] the CPU may reorder the memory accesses
Actually no, x86-64's TSO (Total Store Order) memory model is program-order plus a store buffer with store forwarding. So it has to commit stores from the store-buffer to L1d cache in program order. Only compile-time reordering could change the visibility order of those three stores on x86-64. But that's just a poor choice of example; most other ISAs in widespread use are not that strongly ordered.
volatile
sFootnote 1:
Compile-time reordering of other operations around volatile
is allowed and does happen, except on MSVC with /volatile:ms
. plain = 1; vol = 2; plain = 3;
will do dead-store elimination and do just the plain = 3
assignment before or after the volatile store. (If it's not optimized into a register or away entirely.)
For example, adding more assignments to x
, GCC still only does one store to it
(Godbolt compiler explorer)
volatile int a;
volatile int b;
int x;
void func() {
x = -1; // non-volatile
a = 1;
x = 0; // your original non-volatile
b = 2;
x = -2; // A third assignment to the same plain var
}
# GCC14 -O3
func:
movl $1, a(%rip)
movl $-2, x(%rip) # x=-2 between the two volatiles
movl $2, b(%rip)
ret
Upvotes: 4
Reputation: 182734
CPUs might reorder instructions but they have to make sure the outcome is the same as if they hadn't.
Upvotes: 6