Semantics of atomic stores in MESI cachelines

Question

In a concurrently read and written to line (reads and stores only). What happens when a line is owned by a core in modified-or-read mode, and some other core issues store operations on this line (assuming these reads and writes are std::atomic::load and std::atomic::store with C++ compiilers)? Does the line get pulled into the other core that is issuing the writes? Or do the writes find their way over to the reading core directly as needed? The difference between the two is that the latter only causes the overhead of one roundtrip for reading the value of the line. And can possibly get paralellized as well (if the store and read happen at staggered points in time)

This question arose when considering the consequences of NUMA in a concurrent application. But the question stands when the two cores involved are in the same NUMA node.

There are a large number of architectures in the mix. But for now, curious about what happens on Intel Skylake or Broadwell.

Peter Cordes · Accepted Answer

First of all, there's nothing special about atomic loads/stores vs. regular stores by the time they're compiled to asm. (Although the default seq_cst memory order can compile to xchg, but mov+mfence is also a valid (often slower) option which is indistinguishable in asm from a plain release store followed by a full barrier.) xchg is an atomic RMW + a full barrier. Compilers use it for the full-barrier effect; the load part of the exchange is just an unwanted side-effect.

The rest of this answer applies fully to any x86 asm store, or the store part of a memory-destination RMW instruction (whether it's atomic or not).

Initially the core that had previously been doing writes will have the line in MESI Modified state in its L1d, assuming it hasn't been evicted to L2 or L3 already.

The line changes MESI state (to shared) in response to a read request, or for stores the core doing the write will send an RFO (request for ownership) and eventually get the line in Modified or Exclusive state.

Getting data between physical cores on modern Intel CPUs always involves write-back to shared L3 (not necessarily to DRAM). I think this is true even on multi-socket systems where the two cores are on separate sockets so they don't really share a common L3, using snooping (and snoop filtering).

Intel uses MESIF. AMD uses MOESI which allows sharing dirty data directly between cores directly without write-back to/from a shared outer level cache first.

For more details, see Which cache mapping technique is used in intel core i7 processor?

There's no way for store-data to reach another core except through cache/memory.

Your idea about the writes "happening" on another core is not how anything works. I don't see how it could even be implemented while respecting x86 memory ordering rules: stores from one core become globally visible in program order. I don't see how you could send stores (to different cache lines) to different cores and make sure one core waited for the other to commit those stores to the cache lines they each owned.

It's also not really plausible even for a weakly-ordered ISA. Often when you read or write a cache line, you're going to do more reads+writes. Sending each read or write request separately over a mesh interconnect between cores would require many many tiny messages. High throughput is much easier to achieve than low latency: wider buses can do that. Low latency for loads is essential for high performance. If threads ever migrate between cores, all of a sudden they'll be read/writing cache lines that are all hot in L1d on some other core, which would be horrible until the CPU somehow decided that it should migrate the cache line to the core accessing it.

L1d caches are small, fast, and relatively simple. The logic for ordering a core's reads+writes relative to each other, and for doing speculative loads, is all internal to a single core. (Store buffer, or on Intel actually a Memory Order Buffer to track speculative loads as well as stores.)

This is why you should avoid even touching a shared variable if you can prove you don't have to. (Or use exponential backoff for cases where that's appropriate). And why a CAS loop should spin read-only waiting to see the value its looking for before even attempting a CAS, instead of hammering on the cache line with writes from failing lock cmpxchg attempts.

Semantics of atomic stores in MESI cachelines

Answers (1)

Related Questions