Reputation: 55
I've written a basic graph scheduler that synchronises task execution in a wait-free manner. Since the graph topology is immutable, I figured I'll make all atomic operations relaxed. However, as I learned more about the CPU hardware, I started to become concerned about the behaviour of my data structure on platforms with weaker memory models (I've only tested my code on x86). Here's the scenario that bothers me:
Thread 1 (T1) and Thread 2 (T2) concurrently update (non-atomically) memory locations X and Y respectively, and then proceed to execute other unrelated tasks.
Thread 3 (T3) picks up a dependent task after T1 and T2 are done, loads X and Y, and sums them up. There are no acquire/release synchronisations, thread joins, or locks being invoked, and T3's task is guaranteed to be scheduled after T1's and T2's are done.
Assuming that T1, T2, and T3 are scheduled (by the OS) on different CPU cores, my question is: In the absence of any memory fences or lock-like instructions, is T3 guaranteed to see the latest values of X and Y? Another way of asking this is: If you don't insert a fence, how long after a store can you perform a load, or are there no guarantees regarding that?
My concern is that there are no guarantees that the cores which executed T1 and T2 have flushed their store buffers by the time T3's core attempts to load that information. I tend to think of data races as data corruptions that happen due to a load and a store (or store and store) happening at the same time. However, I've come to realise that I'm not quite sure what at the same time really means given the distributed nature of CPUs at the micro scale. According to CppRef:
A program that has two conflicting evaluations has a data race unless:
- both evaluations execute on the same thread or in the same signal handler, or
- both conflicting evaluations are atomic operations (see std::atomic), or
- one of the conflicting evaluations happens-before another (see std::memory_order)
This seems to imply that anyone using my graph scheduler would experience a data race (assuming they don't protect against it themselves) even if I can guarantee that T3 doesn't execute until T1 and T2 are done. I've yet to observe a data race in my tests but I'm not naive enough to think that tests alone suffice to prove this.
Upvotes: 2
Views: 178
Reputation: 363980
how long after a store can you perform a load
ISO C++ makes zero guarantees about timing. It's almost always a bad idea to rely on timing / distance for correctness.
In this case all you need is acquire/release synchronization somewhere in the scheduler itself, e.g. T1 and T2 declaring themselves done using a release-store, and the scheduler checking for that with an acquire load.
Otherwise what does it even mean that T3 executes after T1 and T2? If the scheduler could see an "I'm done" store way early, it could start T3 while T1 or T2 is not done all its stores.
If you make sure that everything in T3 happens after T1 and T2 (using acquire loads that "synchronize-with" a release store from each of T1 and T2), you don't even need to use atomics in T1 and T2, just in the scheduler machinery.
Acquire load and release store are relatively cheap compared to seq_cst. On real HW, seq_cst has to flush the store buffer after a store, release doesn't. x86 does acq_rel for free.
(And yes, testing on x86 doesn't prove anything; the hardware memory model is basically acq_rel so compile-time reordering picks some legal order, and then that order runs with acq_rel.)
I'm not sure if starting a new thread guarantees that everything in that thread "happens after" that point in this thread. If so, then this is formally safe.
If not, then in theory IRIW reordering is something to worry about. (All threads using seq_cst loads have to agree about the global order of seq_cst stores, but not with weaker memory orders. In practice PowerPC is the hardware that can do this in real life, AFAIK, and only for short-ish windows. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. Any std::thread
constructor would involve a system call and be long enough in practice, as well as involving barriers anyway whether ISO C++ formally guarantees that or not.
If you're not starting a new thread, but instead storing a flag for a worker to see, then acq/rel is again enough; happens-before is transitive so A -> B and B -> C implies A -> C.
Upvotes: 1