Amit
Amit

Reputation: 1130

Is there an issue with "cache coherence" on C++ multi-threading on a *Single CPU* (Multi-Core) on Windows?

(EDIT: Just to make it clear: The question of "cache coherence" is in the case that there is no use of atomic variables.)

Is it possible (A single CPU case: Windows can run on top of Intel / AMD / Arm CPU), that thread-1 runs on core-1 stores a bool variable (for example) and it stays in L1 cache, and thread-2 runs on core-n uses that variable, and it looks on another copy of it that is in the memory?

Code example (To demonstrate the issue, lets say that the std::atomic_bool is just a plain bool):

#include <thread>
#include <atomic>
#include <chrono>

std::atomic_bool g_exit{ false }, g_exited{ false };

using namespace std::chrono_literals;

void fn()
{
    while (!g_exit.load(std::memory_order_acquire))
    {
        // do something (lets say it takes 1-4s, repeatedly)
        std::this_thread::sleep_for(1s);
    }

    g_exited.store(true, std::memory_order_release);
}

int main()
{
    std::thread wt(fn);
    wt.detach();

    // do something (lets say it took 2s)
    std::this_thread::sleep_for(2s);

    // Exit

    g_exit.store(true, std::memory_order_release);

    for (int i = 0; i < 5; i++) { // Timeout: 5 seconds.
        std::this_thread::sleep_for(1s);
        if (g_exited.load(std::memory_order_acquire)) {
            break;
        }
    }
}

Upvotes: 2

Views: 1454

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 363980

CPU cache is always coherent across cores that we run C++ threads across1, whether they're in the same package (a multi-core CPU) and/or spread across sockets with an interconnect. That makes it impossible to load a stale value once the writing thread's store has executed and committed to cache. As part of doing that, it will send an invalidate request to all other caches in the system.

Other threads can always eventually see your updates to std::atomic vars, even with mo_relaxed. That's the entire point; std::atomic would be useless if it didn't work for this. ("Eventually" is often about 40 nanoseconds inter-thread latency; relaxed isn't worse for this, it just doesn't stall execution of later memory operations until a store to be visible to other threads like seq_cst needs to on most ISAs. Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? no, or not significantly)


But without std::atomic, you code would be super broken, and a classic example of MCU programming - C++ O2 optimization breaks while loop and Multithreading program stuck in optimized mode but runs normally in -O0 - the compiler can assume that no other thread is writing a non-atomic var it's reading, so it can hoist the actual load out of the loop and keep it in a thread-private CPU register. So it's not re-reading from coherent cache at all. i.e. while(!exit_now){} becomes if(!exit_now) while(1){} for a plain bool exit_now global.

Registers are thread-private, not coherent in any way, so code written with plain int or bool can break this way even in a uniprocessor system. Context-switches just save/restore registers to thread-private kernel buffers, they don't know what the code was using registers for so won't ever create the effect of re-reading bool g_exit from memory into the thread's register. In fact the code might not even be re-checking a register after optimizing while(!non_atomic_flag){} into if(!non_atomic_flag) while(42){}

(Except that your sleep_for call would probably prevent that optimization. It's probably not declared pure, because you don't want compilers to optimize out multiple calls to it; time is the side-effect. So the compiler has to assume that calls to it could modify global vars, and thus would re-read the global var from memory (with normal load instructions that go through coherent cache)).

Also related: Why set the stop flag using `memory_order_seq_cst`, if you check it with `memory_order_relaxed`?


Footnote 1: C++ implementations that support std::thread only run it across cores in the same coherency domain. In almost all systems, there is only one coherency domain that includes all cores in all sockets, but huge clusters with non-coherent shared memory between nodes are possible.

So are embedded boards with an ARM microcontroller core sharing memory but not coherent with an ARM DSP core. You wouldn't be running a single OS across both those cores, and you wouldn't consider code running on those different cores part of the same C++ program.

For more details about cache coherency, see When to use volatile with multi threading?

Upvotes: 7

Related Questions