Reputation: 1772
The atomic_compare_exchange_strong_explicit()
function takes two memory_order
parameters, success
and failure
(as does atomic_compare_exchange_weak_explicit()
). Unpicking the C11/C18 Standards, I find that the allowed values for success
and failure
are:
success = memory_order_relaxed failure = memory_order_relaxed
success = _release failure = _relaxed
success = _consume failure = _relaxed
or _consume
success = _acquire failure = _relaxed
or _acq_rel or _consume (?)
or _acquire
success = _seq_cst failure = _relaxed
or _consume (?)
or _acquire
or _seq_cst
The Standard also says:
Further, if the comparison is true, memory is affected according to the value of success, and if the comparison is false, memory is affected according to the value of failure. These operations are atomic read-modify-write operations (5.1.2.4).
Your ARM, POWER-PC and other "LL/SC" devices do a Load-Link/Cmp/Store-Conditional sequence to implement atomic-cmp-exchange, where the Load-Link may or may not be _acquire and the Store-Conditional may or may not be _release.
So I can understand: success = _acq_rel and failure = _acquire.
What I cannot get my head around is (amongst others): success = _acq_rel and failure = _relaxed. Surely, in order to achieve _acq_rel, the Load-Link must be _acquire ? If the cmp fails, surely it is then too late to downgrade to _relaxed ?
What is the intended meaning of the combinations of success
and failure
parameters (when they are not equal) ?
[It's always possible that I have been bamboozled by the Standard, and that the failure
memory-order may, in fact, only be the read half of whatever the success
memory-order is.]
Upvotes: 3
Views: 743
Reputation: 365247
One way of implementing an acquire load in asm on some ISAs is a plain load followed by a fence, e.g. on ISAs like PowerPC or ARM before ARMv8 introduced ldar
/ ldaxr
. This later fence can be skipped if the failure ordering doesn't include acquire.
An LL/SC CAS_weak might look something like this in pseudo-asm for no real ISA:
ll r0, mem
cmp r0, r1
jne .fail # early-out compare fail
sc r2, mem # let's pretend this sets CF condition code on SC failure
jc .fail # jump if SC failed
lwsync # LoadLoad, StoreStore, and LoadStore but not StoreLoad
... CAS success path
.fail: # we jump here without having executed any barriers
... CAS failure path
This (I think) could be a valid implementation of mem.compare_exchange_weak(r1, r2, mo_acquire, mo_relaxed);
on some kinds of machines.
It's only an acquire operation so the whole RMW can reorder with earlier operations (the nature of LL/SC keeps them stuck together in the global order). There's no barrier ahead of the operation, only after.
lwsync
is a PowerPC barrier instruction which blocks all reordering except StoreLoad (i.e. doesn't flush the store buffer). https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120930/weak-vs-strong-memory-models/
To implement CAS(..., acq_rel, relaxed)
on most ISAs, we'd also run a barrier before the LL/SC (to separate it from earlier operations, creating the release part). That would execute even on the failure path, but would not create acquire semantics. It wouldn't separate the load from later operations.
AFAIK you wouldn't want to put the fence between LL and SC to maybe skip if the compare failed. That would lengthen the transaction and give more chance for it to fail from activity of other threads.
As always when implementing the C++ memory model on top of real asm, you do something that's as strong as necessary but no stronger, given the limits of what primitives the underlying ISA provides. Most ISAs won't make it possible for the CAS(acq_rel, relaxed) failure path to actually be as cheap as a plain relaxed load, either because it's impossible or because it would hurt the normal-case performance. But on some it can still be less expensive than if the failure side needed to have acquire semantics.
Some ISAs (like ARM) apparently only have full barriers (dsb ish
) so even an acq_rel ends up draining the store buffer. (So ARMv8 introducing acquire-load and sequential-release stores was very nice, since it perfectly matches C++ seq-cst semantics and can potentially be much cheaper than a barrier.)
Upvotes: 5