x86cpu-architecturememory-barriersconsistencymemory-model

Reputation: 181

How does the x86 TSO memory consistency model work when some of the stores being observed come from store-forwarding?

I've a question about x86 TSO memory consistency model. Starting from 'A primer on Memory Consistency and Cache Coherence' it seems to me that the 'global store order' (i.e. the global order of stores in memory order) could be different from the point of view of cores involved.

Suppose you have 2 cores each with a FIFO write buffer as in the following model:

Each Core issues/inserts stores into its write buffer in program order and I believe the global store order basically is what defined by the order the 'memory switch' uses to select Cores to 'drain' their write buffers.

Now adding the store-load forwarding for loads that bypass last stores in program order on each Core, the sequence of loads that each program running on a Core sees can be different form what seen from the program running on the other Core (even if the subset of stores issued from a program running on a core is actually in its program order).

Maybe I'm missing some point...

Upvotes: 2

Answers (3)

Brendan

Reputation: 37232

How does the x86 TSO memory consistency model work when some of the stores being observed come from store-forwarding?

80x86 itself is not TSO, it's "TSO with store forwarding", so TSO does not necessarily work.

To enforce a TSO consistency model you may have to use explicit barriers (e.g. mfence) to prevent some store forwarding.

For a specific example, consider one CPU doing:

 C = -1;

 A = 0;
 B = 0;
 if(B == 1) C = A;

.. while another CPU does:

 A = 1;
 B = 1;

For TSO, it would be impossible for the result to be C == 0 - either the other CPU didn't do B = 1 fast enough and therefore C = -1, or the other CPU must have already done A = 1 and therefore C = 1.

For "TSO with store forwarding" the first CPU can see that the other CPU did B = 1 and then forward its old value of A from the store at A = 0 to the C = A; and therefore C == 0 is possible (when it should not be possible for TSO).

As a general rule; whenever a CPU's reads might not occur in the same order as its writes, you need a barrier to enforce TSO.

For my example, the first CPU writes A then B, but then reads B then A (which is the opposite order), so it needs a barrier to enforce TSO. It'd be like:

 C = -1;

 A = 0;
 B = 0;
 MFENCE;
 if(B == 1) C = A;

Of course for most programs differences between TSO and "TSO with store forwarding" are extremely rare. Often the programming language (e.g. C) doesn't assume TSO in the first place, and often you have some kind of lock or mutex to enforce a specific order regardless of what the CPU does.

Upvotes: 1

pveentjer

Reputation: 11317

Modern CPUs have coherent caches. A coherent cache is implemented by a cache coherence protocol. When a store wants to commit to the coherent cache, it first needs to wait for the cache line to be invalidated on all CPUs. Any other CPU that would access the cache after the cache line has been invalidated, will not see an older version.

Writing to a cache line is linearizable because the effect of writing to a cache line takes place between the start of the write and completion of the write (its linearization point is between these 2 events). Reading from the cache is linearizable as well since its linearization point is also between start and completion.

Because accessing a cache line is linearizable and linearizability is composable, the cache (the set of all cache lines) is linearizable. Because the cache is linearizable, there is guaranteed to be a total order over all reads and writes from the cache.

This total order might or might not be consistent with the program order; this is not a concern of the cache. E.g. out-of-order execution or committing stores from the store buffer, into the cache, out of order might break program order. But at least nobody will be able to disagree on the order in which loads/stores are done on the cache.

The problems that prevent having a total order on the memory order are caused by the layers above the cache. For example, the store buffer of one CPU might be shared with a strict subset of the CPUs (e.g. PowerPC). This way some CPUs can see stores early and others can't. Another cause is STLF (Store To Load Forwarding, aka store forwarding).

How does TSO/STLF fit into the picture:

Because older stores can be reordered with newer loads to different addresses, due to store buffers, we need to give up the [StoreLoad] ordering within memory order. But this doesn't mean we can't have a total order over all loads/stores; it just means we need to relax the program order restrictions within the memory order.

The tricky part is STLF. Because with STLF there can be disagreement on how a load is ordered with respect to the store it reads from. Example:

int a=0, b=0

CPU1
   a=1    (1)
   r1=a   (2)
   r2=b   (3)

CPU2:
   b=1    (4)
   r3=b   (5)
   r4=a   (6)

Condition:
    r1=1,r2=0,r3=1,r4=0

The condition is possible with TSO. The load 'r1=a' can only be ordered after the store 'a=1' (a load can't be ordered in the memory order before its store it reads from).

But 'r1=a' is also ordered before 'a=1' due to STLF: it takes its value from the store buffer, before 'a=1' is inserted into the memory order.

So we end up with a cycle in the memory order. To resolve this problem, the load 'r1=a' and the store 'a=1' are not ordered by the memory order. As a consequence, we are just left with a total order over the stores.

The book "A primer on Memory Consistency and Cache Coherence" is a superb book. A lot of my understanding on this topic comes from this book and the linearizability argument and dealing with the STLF cycle from this book (I also had a few message exchanges with one of the authors).

Upvotes: 4

CarloC

Reputation: 181

Starting from comments received, I would try to answer my question.

Let's assume Cores execute load & store memory operations in program order in the model above. Each Core inserts its own stores into its FIFO write buffer (store queue).

When a load executes there are 2 cases:

there is at least a store to the same address as the load waiting inside the write buffer (store queue): store-load forwarding takes place and the load is satisfied from the store queue and it does not hit the memory system
there is no store to the same address as the load waiting inside the store queue: the load has to be satisfied from the memory system, so the Core actually stalls waiting for the 'memory switch' to select it to serve the load and drain the pending stores inside the store queue

This way TSO memory consistency is satisfied and its global order is defined by the sequence in which stores from all cores become globally visible (i.e. drained from the cores's store queues into the memory according to the sequence defined by the 'memory switch')

Upvotes: 1

How does the x86 TSO memory consistency model work when some of the stores being observed come from store-forwarding?

Answers (3)

How does TSO/STLF fit into the picture:

Related Questions