WaltK
WaltK

Reputation: 774

Can the compiler optimize out accesses with memory order relaxed that are not ordered by any memory fence?

Consider the following code:

#include <atomic>

std::atomic<bool> flag;

void thread1()
{
    flag.store(true, std::memory_order_relaxed);
}

void thread2()
{
    while (!flag.load(std::memory_order_relaxed))
        ;
}

Under the Standard, could the compiler optimize out the store in thread1 (since thread1 has no release fence)? Making thread2 an infinite loop? Or, could the compiler register buffer flag in thread2 after reading it from memory once (making thread2 potentially an infinite loop regardless of how flag is written)?

If that's a potential problem, would volatile fix it?

volatile std::atomic<bool> flag;

An answer that quotes from the Standard would be most appreciated.

Upvotes: 4

Views: 307

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 365557

No, skipping the store or hoisting the std::atomic load out of the loop (and inventing an infinite loop that the thread could never leave!) would violate the standard's guidance that stores "should" be visible to loads "in a reasonable amount of time" [atomics.order] and in "finite period of time" [intro.progress].

I suspect that they're only "should" not "must", and not more strongly worded, because context switches and extreme loads can suspend another thread for a long time in the worst case (swap thrashing, or even using a debugger to pause and single-step one of the program's threads). Also to allow for example cooperative multi-tasking on a single core where time between context switches might sometimes be high-ish.

Those aren't just notes; they are normative. One might argue that a Deathstation 9000 could totally ignore those "should" requirements, but without some justification it seems unreasonable. There are lots of ways to make an ISO C++ implementation that's nearly unusable, and any implementation that aims to be usable will definitely compile that .store(true, relaxed) to an asm store, and the load to a load inside the loop.

Why set the stop flag using `memory_order_seq_cst`, if you check it with `memory_order_relaxed`? asks a slightly different thing about an equivalent exit_now spin-loop (worried more about keeping inter-thread latency low, rather than worried about it being infinite), but the same quotes from the standard apply.

CPUs commit stores to cache-coherent memory as soon as they can; stronger orders (and fencees) just make this thread wait for things, e.g. for an acquire load to complete before taking a value for other loads. Or for earlier stores and loads to complete before a release-store commits to L1d cache and itself becomes globally visible. Fences don't push data to other threads, they just control the order for when that does happen. Data becoming globally visible happens on its own very fast. (If you were implementing C++ on hypothetical hardware that didn't work this way, you'd have to compile even a relaxed store to include extra instructions to flush its own address.)

IDK if this misconception about barriers being needed to create visibility (rather than to order it) is what's happening here, or if you're just asking what actual wording in the standard prevents a Deathstation 9000 from being terrible.


The store definitely can't be optimized away for that and other reasons: it's a visible side effect that changes the program state. It's guaranteed visible to later loads in this thread (e.g. in the caller of the thread2 function). For the same reason, even a non-atomic plain_bool = true assignment couldn't be optimized away unless inlining into a caller that did plain_bool = false afterwards; then dead-store elimination could happen.


Compilers currently don't optimize atomics, treating them basically like volatile atomic<> already, but ISO C++ would allow optimization of atomic flag=true; flag=false; into just flag=false; (even with seq_cst, but also with .store(val, relaxed)). This could remove the time window for other threads to ever detect that the variable was true; ISO C++ makes no guarantee that any state which exists in the abstract machine can actually be observed by another thread.

However, as a quality-of-implementation issue, it can be undesirable to optimize away an unlock/relock or ++ / --, which is part of why compilers don't optimize atomics. Also related: If a RMW operation changes nothing, can it be optimized away, for all memory orders? - merging two RWMs into a no-op can't optimize away their memory-ordering semantics, unless they're both relaxed and there are no fences anywhere, including in possible callers.


Even if compilers did optimize as much as the standard allows per the as-if rule, you still wouldn't need volatile atomic for this case (assuming the caller of thread2() doesn't do flag.store(false, order) right after the call).

But you might perhaps want volatile atomic in other situations. But http://wg21.link/p0062 / http://wg21.link/n4455 point out that even volatile atomic doesn't close all the possible loopholes for overly aggressive optimizations, so the until further design progress is made on letting programmers control when optimization of atomics would be ok, the plan is that compilers will continue to behave as they do now, not optimizing atomics.


Also related re: compiler optimizations inventing infinite loops

Upvotes: 4

Related Questions