Why exactly is a full read memory barrier required in the kernel docs at Documentation/memory-barriers.txt:709 : q = READ_ONCE(a); if (q) { <read barrier> // why? p = READ_ONCE(b); } The explanation says 'the CPU may short-circuit by attempting to predict the outcome in advance, so that other CPUs see the load from b as having happened before the load from a' Does this explanation imply that the CPU executing this snippet will not reorder the reads from a and b ? Why is the order in which other CPUs see the reads important? What would be an example of a scenario where this results in a bug? The only issue I can see is if the CPU emits the reads to a and b out of order. Are the CPUs supported by the kernel allowed to do that reordering? If yes, where is this rule stated? Then I would see the need for the barrier, but not for the reason stated in the explanation. I tried to ask on IRC, but no one knew

clinux-kernelmemory-barriersmemory-model

Reputation: 9

Why does a load-load control dependency require a full read memory barrier

Why exactly is a full read memory barrier required in the kernel docs at Documentation/memory-barriers.txt:709:

q = READ_ONCE(a);
if (q) {
    <read barrier> // why?
    p = READ_ONCE(b);
}

The explanation says 'the CPU may short-circuit by attempting to predict the outcome in advance, so that other CPUs see the load from b as having happened before the load from a'

Does this explanation imply that the CPU executing this snippet will not reorder the reads from a and b?
Why is the order in which other CPUs see the reads important? What would be an example of a scenario where this results in a bug?

The only issue I can see is if the CPU emits the reads to a and b out of order.

Are the CPUs supported by the kernel allowed to do that reordering?
If yes, where is this rule stated? Then I would see the need for the barrier, but not for the reason stated in the explanation.

I tried to ask on IRC, but no one knew

Upvotes: 1

Answers (1)

Peter Cordes

Reputation: 365517

Speculative execution is the key here. A CPU can speculate a load, unlike a store, because it doesn't have any observable side-effects visible to other cores.

CPUs handle control dependencies (branches) with branch prediction + speculative exec instead of like data dependencies. (Difference between data dependence and control dependence).

The second load can start as soon as the address b is available and the load instruction enters the out-of-order back-end, before q is ready. When q is ready and the cbnz or whatever can execute to confirm correct branch prediction, nothing happens, the later load was already started. (If it instead detects that q was zero so the load shouldn't have happened, execution rolls back to the correct path, discarding the load result.)

Yes, most non-x86 CPUs do that. (And modern x86 internally speculates but takes a memory ordering machine clear if a cache line wasn't still valid when it's architecturally allowed to be read. Loading early is so important for performance that it's worth speculating on, especially since speculation will be valid as long as no other core is invalidating the cache line, e.g. true or false sharing.)

Note that these loads are independent: the address for the second doesn't depend on the load result from the first. If they were dependent, all architectures except some models of DEC Alpha would do the dependent load after the first, like C++ memory_order_consume. (Dependent loads reordering in CPU). In the Linux kernel memory model, you'd need smp_read_barrier_depends() which is a no-op on ISAs still supported by Linux. And apparently is now considered obsolete in the kernel, implicit in READ_ONCE macros.

Why is the order in which other CPUs see the reads important? What would be an example of a scenario where this results in a bug?

Writer:

Write elements of buffer[0..9]
smp_wmb();
WRITE_ONCE(data_ready, 1);   // C++11 data_ready.store(1, release)

reader:

if (READ_ONCE(data_ready)) {
   smp_rmb();         // promoting the read to data_ready.read(acquire)
   read buffer[0..9];
}

Without the smp_rmb read memory barrier (or better an acquire-load like AArch64 ldapr like C++11 std::atomic data_ready.load(std::memory_order_acquire) would use), the reader can load old values from buffer[i] from before the writer's stores.

This is the canonical use-case for acquire/release semantics (https://preshing.com/20120913/acquire-and-release-semantics/), or with a spin-loop in the reader.

Upvotes: 3

Why does a load-load control dependency require a full read memory barrier

Answers (1)

Related Questions