Possible orderings with memory_order_seq_cst and memory_order_release

Question

With reference to the following code

auto x = std::atomic{0};
auto y = std::atomic{0};

// thread 1
x.store(1, std::memory_order_release);
auto one = y.load(std::memory_order_seq_cst);

// thread 2
y.fetch_add(1, std::memory_order_seq_cst);
auto two = x.load(std::memory_order_seq_cst);

Is it possible here, that one and two are both 0?

(I seem to be hitting a bug that could be explained if one and two could both hold the value of 0 after the code above runs. And the rules for ordering are too complicated for me to figure out what orderings are possible above.)

Peter Cordes · Accepted Answer

Yes, it's possible for both loads to get 0.

Within thread 1, y.load can "pass" x.store(mo_release) because they're not both seq_cst. The global total order of seq_cst operations that ISO C++ guarantees must exist only includes seq_cst operations.

(In terms of hardware / cpu-architecture for a normal CPU, the load can take a value from coherent cache before the release-store leaves the store buffer. In this case, I found it much easier to reason in terms of how I know it compiles for x86 (or to generic release and acquire operations), then apply asm memory-ordering rules. Applying this reasoning assumes that the normal C++->asm mappings are safe, and are always at least as strong as the C++ memory model. If you can find a legal reordering this way, you don't need to wade through the C++ formalism. But if you don't, that of course doesn't prove it's safe in the C++ abstract machine.)

Anyway, the key point to realize is that a seq_cst operation isn't like atomic_thread_fence(mo_seq_cst) - Individual seq_cst operations only have to recover/maintain sequential consistency in the way they interact with other seq_cst operations, not with plain acquire/release/acq_rel. (Similarly, acquire and release fences are stronger 2-way barriers, unlike acquire and release operations as Jeff Preshing explains.)

The reordering that makes this happen

That's the only reordering possible; the other possibilities are just interleavings of the program-order of the two threads. Having the store "happen" (become visible) last leads to the 0, 0 result.

I renamed one and two to r1 and r2 (local "registers" within each thread), to avoid writing things like one == 0.

// x=0 nominally executes in T1, but doesn't have to drain the store buffer before later loads
auto r1 = y.load(std::memory_order_seq_cst);   // T1b             r1 = 0 (y)
         y.fetch_add(1, std::memory_order_seq_cst);      // T2a   y = 1 becomes globally visible
         auto r2 = x.load(std::memory_order_seq_cst);    // T2b   r2 = 0 (x)
x.store(1, std::memory_order_release);         // T1a             x = 0 eventually becomes globally visible

This can happen in practice on x86, but interestingly not AArch64. x86 can do release-store without additional barriers (just a normal store), and seq_cst loads are compiled the same as plain acquire, just a normal load.

On AArch64, release and seq_cst stores use STLR. seq_cst loads use LDAR, which has a special interaction with STLR, not being allowed to read cache until the last STLR drains from the store buffer. So release-store / seq_cst load on ARMv8 is the same as seq_cst store / seq_cst load. (ARMv8.3 added LDAPR, allowing true acquire / release by letting acquire loads compile differently; see this Q&A.)

However, it can also happen on many ISAs that use separate barrier instructions, like ARM32: a release store will typically be done with a barrier and then a plain store, preventing reordering with earlier loads / stores, but not stopping reordering with later. If the seq_cst load avoids needing a full barrier before itself (which is the normal case), then the store can reorder after the load.

For example, a release store on ARMv7 is dmb ish; str, and a seq_cst load is ldr; dmb ish, so you have str / ldr with no barrier between them.

On PowerPC, as seq_cst load is hwsync; ld; cmp; bc; isync, so there's a full barrier before the load. (The HeavyWeight Sync is I think part of preventing IRIW reordering, to block store-forwarding between SMT threads on the same physical core, only seeing stores from other cores when they actually become globally visible.)

Possible orderings with memory_order_seq_cst and memory_order_release

Answers (1)

The reordering that makes this happen

Related Questions