Do C90-compliant compilers have to take into account instruction reordering by the CPU?

Consider the following piece of code:

volatile int a;
volatile int b;
int x;

void func() {
    a = 1;
    x = 0; /* dummy statement */
    b = 2;
}

In this code snippet, the assignment to x constitutes a sequence point. Hence, according to the C90 standard, the access to the volatile variable a must be finished before the access to b is started. When translating this piece of code to x86-64 assembler, the body of the function is translated as follows:

movl $1, a(%rip)
movl $0, x(%rip)
movl $2, b(%rip)

Now, when executing this code, the CPU may reorder the memory accesses, thus breaking the requirement of the C standard that the accesses to a and b are performed in order. So, isn't this translation incorrect, and wouldn't the compiler have to to insert memory barriers to enforce the ordering?

Edit: Consider the case where a and b are variables shared by two threads. In this case, a synchronization protocol between the two threads may rely on the fact that accesses to a and b occur in order. Thus, when the CPU reorders the accesses, this may break that protocol (I'm not actually trying to implement such a protocol, I'm just wondering what the correct interpretation of the C standard is).

Upvotes: 5

Answers (2)

Peter Cordes

Reputation: 365747

volatile only requires that assembly sequencing (program order) of volatile operations matches the source / abstract machine. (So no compile-time reordering wrt. other volatile accesses¹, and no optimizing away of accesses.)

If you need runtime ordering of memory-visibility from outside this thread stronger than what the hardware guarantees, you need inline asm for memory barriers, or C11 <stdatomic.h>.

See When to use volatile with multi threading? - never; C11 and C++11 made previous such usage obsolete. But for small-enough types it's somewhat like _Atomic with memory_order_relaxed.
(In terms of pure ISO C11, data races on volatile are undefined behaviour. It's up to real implementations to say what happens, but all real-world implementations only run threads of the same program across cores with cache-coherent shared memory, so you do at least get visibility.)

MSVC can optionally give volatile in C/C++ semantics like x86's hardware memory model (memory_order_acq_rel) even on other ISAs, and not doing compile-time reordering even wrt. non-volatile accesses. This is where C#'s volatile semantics came from, IIRC. (Their docs now "strongly recommend" /volatile:iso and using standard C++ synchronization stuff, not volatile with /volatile:ms). None of the other mainstream C and C++ compilers go beyond the ISO spec for ordering of other ops wrt. volatile.

The existence of C11 stdatomic.h and _Atomic makes MS's volatile shenanigans basically obsolete; just use standard portable atomic_store_explicit(&my_atomic_var, 1, memory_order_release); It's a lot more typing, especially in C vs. C++ my_atomic_var.store(1, std::memory_order_release), but it's portable and well-defined and doesn't need compiler options to work right.

Device drivers and memory-mapped I/O

For MMIO purposes, uncacheable memory regions are even more strongly ordered on x86 than normal (cacheable) memory regions. So dev->control_reg = 123; tmp = dev->data_reg; should just work.
If there are any ISAs that aren't like that, you might need memory barrier instructions between volatile accesses to control the order in which devices see your stores and loads.

MMIO, and defeating the optimizer for microbenchmarking or debugging purposes, are the main use-cases for volatile.

Terminology: "instruction reordering" is misleading

"Instruction reodering" is a sloppy way to describe things. add edx, eax ; add eax, ecx has an anti-dependency (write-after-read) hazard on EAX, so statically reordering the asm source would give wrong results. Register renaming avoids this problem, still allowing out-of-order execution of the operations but keeping track of which version of EAX the first add reads.

Memory reordering is separate from out-of-order execution; it can happen even on CPUs that begin execution of instruction in program-order. e.g. ones with hit-under-miss and miss-under-miss caches will scoreboard loads and only stall when a later instruction tries to use a load result that still isn't ready. And store buffers are very common, introducing StoreLoad reordering (which even x86 allows).

(You got it right when you said "may reorder the memory accesses" rather than "the instructions" in the question body.)

The cardinal rule of out-of-order exec is: "don't break single-threaded code". It's like C's as-if rule for optimization, where in both cases the visible things that have to be preserved don't include the order of our memory operations seen from outside (loads reading from cache and stores becoming visible). So when using volatile, that's just up to the hardware memory model. (volatile can't reorder at compile time with other volatile loads/stores, but can with other operations.)

That's why the Linux kernel has CPP macros defined for each target it supports, with inline asm like GNU C asm("lwsync" ::: "memory") (PowerPC for smp_rmb() or smp_wmb()) or asm("" ::: "memory") for barrier() or any x86 barrier other than a full barrier).

x86's memory model

to x86-64 assembler, [...] the CPU may reorder the memory accesses

Actually no, x86-64's TSO (Total Store Order) memory model is program-order plus a store buffer with store forwarding. So it has to commit stores from the store-buffer to L1d cache in program order. Only compile-time reordering could change the visibility order of those three stores on x86-64. But that's just a poor choice of example; most other ISAs in widespread use are not that strongly ordered.

Compile-time reordering of other ops around `volatile`s

Footnote 1:
Compile-time reordering of other operations around volatile is allowed and does happen, except on MSVC with /volatile:ms. plain = 1; vol = 2; plain = 3; will do dead-store elimination and do just the plain = 3 assignment before or after the volatile store. (If it's not optimized into a register or away entirely.)

For example, adding more assignments to x, GCC still only does one store to it
(Godbolt compiler explorer)

volatile int a;
volatile int b;
int x;

void func() {
    x = -1;     // non-volatile
    a = 1;
    x = 0;     // your original non-volatile
    b = 2;
    x = -2;     // A third assignment to the same plain var
}

# GCC14 -O3
func:
        movl    $1, a(%rip)
        movl    $-2, x(%rip)   # x=-2  between the two volatiles
        movl    $2, b(%rip)
        ret

Upvotes: 4

cnicutar

Reputation: 182734

CPUs might reorder instructions but they have to make sure the outcome is the same as if they hadn't.

Upvotes: 6