Reputation: 11402
With ARMv8.3 a new instruction has been introduced: LDAPR.
When there is a STLR followed by a LDAR to a different address, then these 2 can't be reordered and hence it is called RCsc (release consistent sequential consistent).
When there is a STLR followed by a LDAPR to a different address, then these 2 can be reordered. This is called RCpc (release consistent processor consistent).
My issue is with the PC part.
PC is a relaxation of TSO whereby TSO is multi-copy atomic and PC is non multi-copy atomic.
The memory model of ARMv8 has been improved to be multi-copy atomic because no supplier ever created a non multi-copy atomic microarchitecture and it made the memory model more complicated.
So I'm running into a contradiction.
The key question is: is every store (including relaxed) multi-copy atomic?
If so, then the PC part of rcpc doesn't make sense to me since PC is non multi-copy atomic. Could it be a legacy name due to ARM being non multi-copy atomic in the past?
There are multiple definitions of PC; so perhaps that is the cause.
Upvotes: 7
Views: 1785
Reputation: 365707
In practice, STLR / LDAPR gives C++ std::memory_order_release
and acquire
,
as opposed to seq_cst
from STLR / LDAR.
LDAR can't reorder with an earlier STLR, but LDAPR can.
LDAPR allows StoreLoad reordering even with earlier release
and seq_cst
(STLR) stores, vs. LDAR only with earlier relaxed and non-atomic stores (STR). (To the same or different addresses). Waiting for the store buffer to drain is slow, that's the major thing that makes release
/ acquire
faster than code with seq_cst
stores and loads. (Or acquire before ARMv8.3).
So "processor consistent" is presumably describing the fact that the current core sees its own operations in program order, and as a way to note that it's not sequentially consistent because they don't use that term. It doesn't mean that other parts of the memory model rules are removed.
Yes, ARMv8 is multi-copy atomic, so every plain store (str
, stp
, etc.) is multi-copy atomic. i.e. It becomes visible to all other cores at the same time via coherent cache, so all threads can agree on the order of two stores done by two independent writers (the IRIW litmus test). Unlike POWER where some threads can see stores early from other SMT threads on the same physical core.
LDAPR doesn't relax that guarantee.
(ARMv7 did not have this property, and I've heard that some of NVidia's 32-bit ARM designs did have IRIW reordering. But ARM's own designs didn't. ARM was able the strengthen their guarantees without actually changing how anything worked in their own microarchitectures, beyond adding support for ARMv8 32-bit mode new instructions. "Shared Memory Consistency Models: A Tutorial" from 1995, linked in comments, uses the term RCpc to describe a category of memory models that does include some readers being able to see some stores before other readers, allowing IRIW. So maybe being multi-copy atomic is orthogonal to RCpc, and RCpc doesn't imply anything about whether IRIW reordering is allowed or not? Regardless, ARMv8's memory model does forbid IRIW reordering.)
Big caveat: I'm not a terminology expert on this, and I've never heard of "processor consistent" before so I'm just guessing from context what they mean by it, with an interpretation that would be consistent with all known facts. Please correct me if this is incompatible with an accepted definition of the term.
Upvotes: 5
Reputation: 6052
I'm late to the party.
The conversations in the previous posts is very good. However, it is missing 1 key point that I think helps resolve the confusion. The previous conversations have only covered half the story. Let me fill in the second half which is the part that shows the benefit of RCpc over RCsc on Arm.
Yes, Arm systems are multi-copy atomic, which means that it is not possible for a load to get "early access" of a previous write that hasn't completed yet. That is, the previous write to the same address will need to complete (become visible to all) before the load can complete (and return the same updated value to any core that loads the address)
See: https://developer.arm.com/documentation/ka002179/latest
"... Armv8 systems must be multi-copy atomic."
Does this make the LDAPR instruction useless/redundant? No. Here's why.
While yes, it is true that the system level multi-copy atomicity means that LDAPR will not be able to return an early read of a store before it is visible by all observers (i.e. all cores). Thus, delaying forward progress to some degree (i.e. a loss of potential performance benefits). There is still a potential performance benefit in a different way. Forget about the multi-thread/core environments, and instead, just think about a single core. LDAPR provides a potential benefit in this case too.
Recall that within a single thread of execution, LDAPR can reorder before an earlier STLR. This is not true of LDAR. This is another place where there is potential for performance gain. See load-acquirePC section in the link below.
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions
For example, in a single thread of execution, let's say we have CritSection1 -> CritSection2 -> CritSection3, and that each critical section is delimited by acquire-release pairs, and these acquire-release pairs are to different addresses (i.e. the critical sections presumably act on independent non-overlapping memory addresses). Then, using LDAPR allows for the overlap of the execution of these 3 critical sections giving a potential performance boost. Whereas, if you use LDAR, you would not be able to overlap those independent critical sections. They would have to execute sequentially in program order. Of course, you as the programmer, needs to make sure that accesses in those 3 critical sections do not conflict. That the accesses in those 3 critical sections are in fact to independent variables (memory address). Otherwise, those critical sections will stomp all over each, which might not be what you want.
The blog below shows 70% benefit with a networking based workload. I'm not saying that this DPDK test described here is the same scenario I discussed above (I don't know that test or how DPDK works), but maybe this test is a similar thing to what I described. Where independent acquires are happening on sections of some queue. https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/enabling-rcpc-in-gcc-and-llvm
So yes, the full benefit of LDAPR is not possible on Arm because at the system level, it is multi-copy atomic, but there is still potential benefit to single threads of execution. This is how RCpc (ldapr-stlr) can get you more performance compared to RCsc (ldar-stlr).
Upvotes: 1