Lock-free programming: how fresh is atomic value?

Question

An atomic variable is shared among several concurrently running threads. To my knowledge a thread can read a stale value of it performing a relaxed load:

x.load(std::memory_order_relaxed)

What if we use release-acquire synchronization? As far as I can get from the docs, the thread which performs acquire is guaranteed to see what the releasing thread has written into the variable.

// Thread 1:
x.store(33, std::memory_order_release);

// Thread 2:
x.load(std::memory_order_acquire)

Will thread 2 always see the fresh value in this case?

Now if we add the third thread, which performs relaxed store, to the previous example, thread 2 may not see that update as the synchronization is established between thread 1 and 2 only. Am I correct?

Finally, read-modify-write operations are said to always operate on fresh values. Does it mean that they force the thread to "see" the updates made by other threads, so if we load after a read-modify-write operation, we will see the value at least as fresh as that operation did?

// Thread 1:
x.fetch_add(1, std::memory_order_relaxed); // get the fresh value and add 1
if (x.load(std::memory_order_relaxed) == 3) // not older than fetch_add
    fire();

// Thread 2:
x.fetch_add(2, std::memory_order_relaxed); // will see thread 1 modification
if (x.load(std::memory_order_relaxed) == 3) // will see what fetch_add has put in x or newer
    fire();

For x initially zero can I be sure that fire will be called at least once in this case? All my tests proved that it works, but perhaps it's just a matter of my compiler or hardware.

I'm also curious about the ordering. There is a clear dependency between x modification and load, therefore I suppose these instructions not to be reordered in spite of the fact that relaxed order is specified. Or am I wrong?

Peter Cordes · Accepted Answer

The key point is that memory_order_relaxed means that fetch_add doesn't order loads/stores to any other location. It does whatever it is necessary to be atomic and no more. However, dependency-ordering within a single thread still applies.

For fetch_add to be atomic, it must prevent any other fetch_add trying to execute at the same time from operating on the same input value it's reading. Pretty much everything uses a derivative of the MESI cache-coherence protocol, so in practice an atomic fetch_add is done by keeping the cache line "locked" in M state from the read to the write. (Or with LL/SC, detecting that this didn't happen and retrying.) See also Can num++ be atomic for 'int num'? for a more detailed description for x86.

C++ doesn't specify anything about how it's implemented, but it can be useful to have some idea of what kind of hardware operation C++ is trying to expose.

For x initially zero, can I be sure that fire will be called at least once in this case?

Yes, it will run at least once. It can run twice, since it's possible for both loads see the result from the second fetch_add.

The loads always see a value for x that has been updated at least by the fetch_add in its own thread. memory_order_relaxed doesn't allow a thread to observe its own operations happening out of order, and the fetch_adds do happen in some order. So at least the thread that "goes second" will see x == 3.

It's not predictable what the order will be. You could observe it by looking at the return value of fetch_add, instead of using a separate load. (This code doesn't observe the order established by the fetch_adds, because multiple threads can get the same value. For that, you need to capture the value as part of the single atomic operation.)

I'm also curious about the ordering. There is a clear dependency between x modification and load, therefore I suppose these instructions not to be reordered in spite of the fact that relaxed order is specified.

Memory reordering at run-time and compile-time, and out-of-order execution, all preserve the behaviour of a single thread. The golden rule of all this stuff (including the compiler's "as-if rule") is "don't break single-threaded code".

But the order in which these operations become globally visible to other threads is not guaranteed. The store part of the x.fetch_add can become globally visible to other threads after the x.load. It won't on x86, because x86 doesn't reorder stores with later loads from the same address (but StoreLoad reordering is allowed for other addresses, even when a store and load partially overlap.)

A 3rd thread might see the operations of T1 and T2 as happening in a different order than T1 sees T2's operations. (i.e. there doesn't have to be a total order that all threads agree on, unless you use sequential consistency.)

Note that "observing" a load becoming globally visible is only possible indirectly: by looking at other stores done by the thread, to figure out what value must have loaded. But loads are a real and important part of ordering.

So if fire() writes something to memory, a 3rd thread could see that happen while it still sees x==0. That's only possible if it looks (to thread 3) like T1's load happened before the fetch_add in either thread. Even if thread 3 was using x.fetch_add(mo_relaxed) to observe the value in x, C++ allows this. But like I said, you won't see it on x86.

Note that "dependency ordering" is a phrase with another meaning. Even on weakly-ordered CPUs, loading a pointer and then dereferencing it guarantees that LoadLoad reordering doesn't happen between. There is only one architecture that didn't do this: Alpha requires a barrier for this to work. This is one reason that memory_order_consume and memory_order_relaxed are separate.

You can use memory_order_consume instead of memory_order_acquire to more cheaply synchronize with a mo_release producer, if it's publishing a pointer. (compilers typically just strengthen consume to acquire, because it's hard to implement.)

The mo_consume meaning of dependency-ordering applies between two different memory locations: a pointer and a buffer. It's a totally different thing from the fact that a thread always sees its own actions in order.

A thread will always see data at least as new as its own most recent store, when there's only a single object. (new in the order it observes for operations, not new as in absolute time! This just means that a thread can't store a value, and then have its next read find that the value reverted to one from a store that had already been seen before the fetch_add.)

It's not easy to reason about this stuff (e.g. global order seen by other threads, vs. order seen by one of the threads involved). Enumerating all the surprising reorderings that might or might not be possible is also tricky.

There are some links in the lock-free and stdatomic tag wikis. I'd especially recommend Jeff Preshing's blog. He explains things well, and has posted many good articles.

Lock-free programming: how fresh is atomic value?

Answers (1)

Related Questions