Can you have torn reads/writes between two threads pinned to different processors, if the system is cache coherent?

Question

If you have two threads in the same processor, you can have a torn read/write.

For example, on a 32 bit system with thread 1 and thread 2 running on the same core:

Thread 1 assigns a 64 bit int 0xffffffffffffffff to a global variable X, which is initially zero.
The first 32 bits is set to the first 32 bits is set in X, now X is 0xffffffff00000000
Thread 2 reads X as 0xffffffff00000000
Thread 1 writes the last 32 bits.

The torn read happens in step 3.

But what if the following conditions are met:

Thread 1 and Thread 2 are pinned to different cores
The system uses MESI protocol to achieve cache coherence

In this case, is the torn read still possible? Or would the cache line be seen as invalidated in step 3, thereby preventing the torn read?

Peter Cordes · Accepted Answer

Yes, you can have tearing.

A share-request for the line could come in between committing the two separate 32-bit stores. If they're done by separate instructions, the writing thread could even have taken an interrupt between the first and 2nd store, defeating any store coalescing in a store buffer (into aligned 64-bit commits like some 32-bit RISC CPUs are documented to do) that might normally make it hard to observe tearing in practice between separate 32-bit stores.

Another way to get tearing is if the read side loses access to the cache line after reading the first half, before reading the 2nd half. (Because it received and RFO (read for ownership) from the writer core.) The first read could see the old value, the 2nd read could see the new value.

The only way for this to be safe is if both the store and the load are each done as a single atomic access to L1d cache of the respective core.

(And if the interconnect itself doesn't introduce tearing; note the case of AMD K10 Opteron that tears on 8-byte boundaries between cores on separate sockets, but seems to have aligned-16-byte atomicity between cores in the same socket. x86 manuals only guarantee 8-byte atomicity, so the 16-byte atomicity is going beyond documented guarantees as a side effect of the implementation.)

Of course, some 32-bit ISAs have a load-pair or store-pair instruction, or (like x86) guaranteed atomicity for 64-bit aligned loads/stores done via the FPU / SIMD unit.

If tearing is normally possible, how would such a microarchitecture implement 64-bit atomic operations?

By delaying response to MESI requests to share or invalidate a line when it's in the middle of doing a pair of loads or pair of stores done with a special instruction that gives atomicity when a normal load-pair or store-pair wouldn't. The other core is stuck waiting for the response, so there has to be a tight limit on how long you can ever delay responding, otherwise starvation / low overall throughput progress is a problem.

A microarchitecture that normally does a 64-bit access to cache for load-pair / store-pair would get atomicity for free by splitting that one cache access into two register outputs.

But a low-end implementation might not have such wide cache-access hardware. Maybe only LL/SC special instructions have 2-register atomicity. (IIRC, some versions of ARM are like that.)

Can you have torn reads/writes between two threads pinned to different processors, if the system is cache coherent?

Answers (1)

If tearing is normally possible, how would such a microarchitecture implement 64-bit atomic operations?

Related Questions