#include <atomic> #include <thread> void test_relaxed() { using namespace std; atomic<int> x{0}; atomic<int> y{0}; std::thread t1([&] { auto r1 = y.load(memory_order_relaxed); //a x.store(r1, memory_order_relaxed); //b }); std::thread t2([&] { auto r2 = x.load(memory_order_relaxed); //c y.store(42, memory_order_relaxed); //d }); t1.join(); t2.join(); } According to cppreference ( in a relaxed ordering example), the above code is allowed to produce r1 == r2 == 42 . But I have tested it on x86-64 and arm64 platforms and I cannot get this result. Is there any way to get it in practice with real compilers and CPUs? (<a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGIM6SuADJ4DJgAcj4ARpjE/gEADqgKhE4MHt6%2B/qRJKY4CIWGRLDFxZgF2mA5pQgRMxAQZPn4VmPb5DLX1BIUR0bHxtnUNTVkVwz2hfSUD5QCUtqhexMjsHOYAzKHI3lgA1CYbbkxELHimG9gmGgCCm9u7mAdHBAjEmEzoh1e31zcAbqg8Og9gRMAoCAB9d4GVSYdAQOZ/EwAdistz2mL2XhSRj2zDYCgSTFWewhXw26JuWL2J1QZwublCBG%2Be1UqIsGlRABFDlSaXSGYcmYJWQBPDlclG8ynIjFY8kgECvd6fUFcCAmACsVjMADZtdyDmi/jSBV4iHtiFxnkaxQA6WioT4QNgsEhiyEkLDEaFtJhw9BIyl7AD0oaYprNmNU9ohJEwEGtpD2bo9XuIPr9sPhwYsYdDUSj0Z5ebl1IVBHQSpVHxBBDMmp15gNWqNHOLZqYFtQVrMtrZjudCLTxE93ti2YDub5BdM8ujDvj7wgkgCqcw7rHGazMOnQdn4a%2BC%2BN3LLJ87BC49u0QIYiL5l7MN7vD9lv2lHAWtE4Wt4fg4LRSFQTg3GsawySWFYnk2HhSAITQvwWBA6wGRFSAAaxADYNntHD8IIwi9X0ThJH/RDgM4XgFBADR4MQhY4FgJA0BYBI6FichKFY9j6DieoWD1SRgC4co6Kwf5zkwAA1PBMAAdwAeQSRhODgmhaDBYgaIgKIKKiUJ6jFNTeAM5gx0UqJtCqBDuF4Vi2EERSGFoYzAN4LAWEMYBxHc0h8Heap/nBCjMDhZALTWODmTaCjaDwKJiCMjwsAoghiDwFgTIWKgDGABRZIU5TVLsmRBBEMR2CkMr5CUNQKN0Lh9G8lBwMsfQEpoyAFlQBIOhojgAFpFLMai2hstIXAYdxPGaPRgimYpSj0XJUgEUY/Ca1aOl6JaBiaypqgELoRlmrIDvGo7OgmXb%2BjiA6Jg2vQIW6W6ZnuhYFCg1YJG/X9yL8kCOD2G4ACUAFkhL2YBkGQPZRPtZ8ND2CBcEIEgDjMDYuDmXhbK0OYFiwnC8MIsmcOIn8ODI0gsq4DQ6IAoCgeo2j6PcxiYEQFB6V4ziKAgHiOLiETyn8zBJNWQqlJUgD1LoLSdL0vyzKMkzSFViyrJs9WHMYAhnNcijPO83ygICibgoGoCwqqSL1ZiqmgPixLkowNYgPSzLsr4PKCrkmWSvU2QKvEar%2BEERQVHUPzdACAwjFayxrA6qIuvQ3r%2Bs4YagMOjoppmzJNsCaa3uWrbkjW9IzuL7a0jL/bbEujoTsaGvnubmobsWu7nse9uHtenv3t%2BxZlh%2BnGSI4P9SCZ3ggbZAAOPVBqhmG4YR/tUfwIhiEx7HcfZgmiew3DyfJqeabnyiOFZuj8aQqfRtniiWaPx/gu0ybJCAA" rel="nofollow noreferrer">Godbolt )

c++x86cpu-architecturememory-barriersmemory-model

Reputation: 41845

Demonstrate LoadStore reordering with a load getting a value round-tripped to another thread, in practice with relaxed load/store?

#include <atomic>
#include <thread>

void test_relaxed()
{
    using namespace std;
    atomic<int> x{0};
    atomic<int> y{0};

    std::thread t1([&] {
        auto r1 = y.load(memory_order_relaxed); //a
        x.store(r1, memory_order_relaxed); //b
        });

    std::thread t2([&] {
        auto r2 = x.load(memory_order_relaxed); //c
        y.store(42, memory_order_relaxed); //d
    });
    
    t1.join();
    t2.join();
}

According to cppreference (in a relaxed ordering example), the above code is allowed to produce r1 == r2 == 42.

But I have tested it on x86-64 and arm64 platforms and I cannot get this result. Is there any way to get it in practice with real compilers and CPUs?

(Godbolt)

Upvotes: 2

Answers (2)

пётр

Reputation: 56

According to the ARM Memory Tool (the article, the online tool) arm64 allows this behavior (which means it might occur on some arm64 cpus).

The following is a litmus test for your example:

AArch64 SO-q-2023-03-06
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
 P0          | P1          ;
 LDR W0,[X1] | LDR W0,[X1] ;
 STR W0,[X3] | MOV W2,#42  ;
             | STR W2,[X3] ;
exists
(0:X0=42 /\ 1:X0=42)

You can try it yourself in the online tool.

But there could be problems with finding the arm64 hardware that displays such behavior.

I have no data for arm64, but there is such data for ARMv7 (it might give you insight, or you might want to try to reproduce you example on ARMv7).
You test case is very similar to LB+data+po litmus test.
The results for the litmus test on different hardware is here: as you can see, it reproduces only on some hardware.
Meaning of the hardware abbreviations used in the table is given here.

Upvotes: 4

Peter Cordes

Reputation: 365557

On x86-64, this could only happen if the compiler chose to reorder at compile-time. The x86 memory model is program order plus a store-buffer with store forwarding, which only allows StoreLoad plus the quirks of threads seeing their own stores before they're globally visible. (Neither thread is doing that here.) https://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu/ has some practical reordering tests on ARM, but not this one.

I think it's theoretically possible on ARM / AArch64, but only with an extremely slow response to a MESI share request for x. Perhaps multiple other cores spamming atomic RMWs on it, like x.fetch_add(0) in other threads? (Beware of clang optimizing no-op RMWs to just memory barriers without touching the cache line). Or incomplete LDXR / STXR transactions that don't actually modify it? Maybe with hand-written asm with ldxr / clrex if that's a thing in 64-bit mode.

Or more simply, a 3rd and 4th thread spamming stores (and maybe loads) on a different part of the same cache line holding x.

Separate cache lines for the variables is necessary (so use alignas(64) std::atomic<int> x {0}; etc.) We need thread 2's x.load (C) to miss in cache so badly that it doesn't take a value until after a round trip to the other thread. And probably that thread 2 doesn't even send out (to other cores) a request to read the cache line until long after its store has retired and committed to L1d cache. Otherwise the request will go out to other cores before thread 1 has even dirtied that cache line, so data will just come in from DRAM, or from the main thread's store since it's a local on the stack.

A long dependency chain for the load address like Nate tried won't help, unfortunately. That would work if you could delay the load that way without delaying visibility of the store that follows it, but that's not the case due to the way out-of-order exec works. (In-order exec of course automatically delays all later instructions when earlier instructions are slow.)

Stores can't become visible to other cores (commit from the store buffer to L1d cache¹) until after they retire from out-of-order exec, because they're still speculative until then. And speculation needs to be kept private within one core so only it has to roll back on detection of mis-speculation. See Can a speculatively executed CPU branch contain opcodes that access RAM? (All instructions that could potentially fault are speculated, not just branches.)

So the store D can't retire until after the load address for C is ready and the load itself sent to an execution unit, unless the compiler statically (in the asm) reorders the store before the load. It's allowed to do that, but it won't do so for no reason. And compilers don't (currently) optimize atomics, so an earlier store of y to get the compiler to do dead-store elimination won't work, nor will other operations on x.

If the core running thread 2 sends out a MESI share request due to x.load() at about the same time it sends out a RFO (Read For Ownership) due to y.store(), it's going to get replies to both within a pretty narrow window. Far too little time for another core to read the store value and and have modified the cache line holding x before receiving and responding to the share request.

At least under normal conditions; with heavy contention for the cache line holding x, the reply could be delayed a lot, with some other threads getting a turn (giving time for y.store to be seen by y.load in thread 1).

But the x.store in thread 1 (B) won't happen until one hop of inter-thread-latency after y.store in thread 2 (D). And the x.load in thread 1 (C) will have had all that time to get its turn to read that cache line.

Perhaps it could help to have a lot of previous cache misses (to scattered addresses) in thread 2 oustanding, in hopes that the load could still be sent to an execution unit and retire before there was a free buffer to track an off-core MESI request for it. If those loads were all ldapr acquire loads, it would make sure they had to finish before the relaxed ldr.

A non-x86 ISA can retire a load from the out-of-order back-end before the data arrives, as long as the address is known and it's verified that it's non-faulting. (i.e. TLB hit and valid translation.) This is how LoadStore reordering is possible at all.

Other notes on testing methodology:

You need to compile with optimization enabled, otherwise GCC won't inline x.store and so on. Run-time variable memory_order gets treated as seq_cst instead of branching to maybe skip a fence. https://godbolt.org/z/6rz43a716

Thread startup may take longer than the reordering window; as Nate described in comments, you want to run multiple test after starting a thread.

Perhaps have thread 1 spin until it sees a non-0 value, then store that. That removes the need for a timing coincidence there of happening to run at just the right time. Preshing's https://preshing.com/20120515/memory-reordering-caught-in-the-act x86 StoreLoad test has a sync and random delay before doing one test, to maximize the chance both threads are really running at the same time, not having one finish before the OS can get to starting the 2nd one on a different core.

Footnote 1: it's also possible for a core to make retired ("graduated") stores visible to some other cores before retirement: some POWER CPUs have store-forwarding between local cores of the same physical core. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. ARMv8 doesn't allow that, and no earlier ARM CPUs ever did it. x86 also requires that all thread can agree on a total order for all stores (not loads).

Upvotes: 2

Demonstrate LoadStore reordering with a load getting a value round-tripped to another thread, in practice with relaxed load/store?

Answers (2)

Other notes on testing methodology:

Related Questions