Cache Line update consistency for atomically updating a whole cache line?

Question

I have the following scenario and looking for suggestions please:

Need to share data between two threads, A and B each running in different cores in the same processor, where thread A writes to an instance of data structure S and B thread reads it. I need the sharing of S to be as consistent and as fast as possible.

struct alignas(64) S 
{ 
  char cacheline [64]; 
};

Planning to leverage the consistency of a cache line, being visible to other cores as an atomic update. Therefore have thread A write to S as fast as possible (*1) so the update is consistent (atomic from a visibility perspective) and then demote (CLDEMOTE instruction) the cache line to the shared cache so that thread B can read it as fast as possible.

Note 1: The reason why it needs to happen fast is so that when core running thread A starts writing to the cache line, it can update all of its contents completely and then core making it visible in L1 (updates occur in the core store buffer), otherwise if it takes too long to update a "mid-state" of the cache-line may be pushed to L1 incurring into unnecessaries invalidation signals (MESI) penalties (as it needs to do the rest again), and worst inconsistent state in thread B.

Are there better ways to achieve this?

Thanks!

Peter Cordes · Accepted Answer

Yes, store then cldemote is a good plan. It runs as a nop on CPUs that don't support it, so you can use it optimistically. (Test that it actually helps your program on CPUs where it's not a nop, though, in case you accidentally demote before reading the line some more.)

Do you actually need atomicity, or is that just nice to have some of the time? If you need atomicity, you can't use separate store instructions. Coalescing in the store buffer isn't guaranteed; for L1d hits it may only sometimes happen on Ice Lake. And an interrupt can happen at any point (unless interrupts are disabled, but SMI and NMI can't be disabled). Including between two stores you were hoping would commit together.

32-byte AVX aligned loads and stores aren't guaranteed atomic, but in practice they probably are on Haswell and later (where the load/store units are 32 bytes wide).

Similarly, 64-byte AVX-512 loads and stores aren't guaranteed atomic, and very likely won't be in practice on Zen4 where they're done in two 32-byte halves. But they probably are on Intel CPUs with AVX-512, if you want to do some testing and find some "works in practice" functionality that doesn't show any tearing on the actual machine you care about.

16-byte loads/stores are guaranteed atomic on Intel CPUs that have the AVX feature flag. (Finally documented after being true for years, fortunately retroactive with an existing feature bit.) AMD doesn't document this yet, but it's probably true of AMD CPUs with AVX, too.

Related: https://rigtorp.se/isatomic/ / SSE instructions: which CPUs can do atomic 16B memory operations?

movdir64b will provide guaranteed 64-byte write atomicity, but only with NT semantics: evicting the cache line all the way to DRAM. It also doesn't provide 64-byte atomic read, so the read side would need to check sequence numbers or something, like a SeqLock.

Intel TSX (transactional memory) can let you commit changes to a whole cache line (or more) as a single atomic transaction. But Intel keeps disabling it with microcode updates. The HLE part (optimistic lock add handling) is fully gone, but the RTM part (xbegin / xend) can still be enabled on some CPUs, I think.

For a use case like this where one thread is only writing, you might consider a SeqLock, using 4 bytes of the cache line as a sequence number. Optimal way to pass a few variables between 2 threads pinning different CPUs / how to implement a seqlock lock using c++11 atomic library

The writer can load the sequence number, store seq+1 (with a plain mov store, no lock inc needed), store the payload with regular stores, or SIMD if convenient, then store seq+2.

Unfortunately without guarantees of vector load/store atomicity, or of ordering between parts of it, you can't have the reader just load the whole cache line at once, you do need 3 separate loads. (Seq number, whole line, then seq number again.)

But if you want to use 32-byte atomicity which appears to be true in practice on Haswell and Zen2 and later, maybe put a sequence number in each 32-byte half of a cache line, so the reader can check with vpcmpeqd / vpmovmskps / test al,1 to check that the first dword element (sequence number) matched between halves. Or maybe put them somewhere else within the vector to make reassembling the payload cheaper.

This spends space for two sequence numbers to save loads in the reader, but might cost more overhead in shuffling data into / out of vectors. I guess maybe store with vmovdqua [rdi+28], ymm1 / vmovdqu [rdi], ymm0 could leave you with 60 useful bytes starting at rdi+4, overwriting the 4 byte sequence number at the start of ymm1. Store-forwarding to a 32-byte load from [rdi+4] would stall, but narrower loads that don't span the boundary between the two earlier stores would be fine even.

Related Q&As about solving the same problem of pushing data for other cores to be able to read cheaply:

CPU cache inhibition
x86 MESI invalidate cache line latency issue
Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions? - Sapphire Rapids has UIPI for user-space interrupt handling of special inter-processor interrupts. So that's fun if you want low latency notification. If you just want to read whatever the current state of a shared data structure is, SeqLock or RCU are good.

Cache Line update consistency for atomically updating a whole cache line?

Answers (1)

Related Questions