Dean Seo
Dean Seo

Reputation: 5733

Why std::memory_order_relaxed is preferred at CAS loop when failing?

When it comes to implementing CAS Loop using std::atomic, cppreference in this link gives the following example for push:

template<typename T>
class stack
{
    std::atomic<node<T>*> head;
 public:
    void push(const T& data)
    {
      node<T>* new_node = new node<T>(data);
      new_node->next = head.load(std::memory_order_relaxed);

      while(!head.compare_exchange_weak(new_node->next, new_node,
                                        std::memory_order_release,
                                        std::memory_order_relaxed /* Eh? */));
    }
};

Now, I don't understand how come std::memory_order_relaxed is used for the failure case, because as far as I understand, compare_exchange_weak (same for -strong but I'll just use the weak version for convenience) is a load operation at failure, which means it loads from a successful CAS operation in another thread with std::memory_order_release, and thus it should use std::memory_order_acquire to be synchronized-with instead...?

while(!head.compare_exchange_weak(new_node->next, new_node,
                                  std::memory_order_release,
                                  std::memory_order_acquire /* There you go! */));

What if, hypothetically, the 'relaxed load' gets one of the old values, ending up failing again and again, staying in the loop for extra time?

The following scratchy picture is where my brain is stuck at.

enter image description here

Shouldn't a store from T2 be visible at T1? (by having synchronized-with relation with each other)

So to sum up my question,

Upvotes: 6

Views: 1580

Answers (2)

dened
dened

Reputation: 4310

Stricter memory orders are used to prevent data races and shouldn't improve performance in an already correct program. In the example you provided replacing memory_order_relaxed with memory_order_acquire wouldn't fix any data race and can only decrease performance.

Why there is no data race? Because the while loop work only with a single atomic, which is always data-race-free regardless of the memory order used.

Why memory_order_release is used then in the case of success? It is not shown in the example but it is assumed that access of the head uses memory_order_acquire, e.g.:

T* stack::top() {
  auto h = head.load(std::memory_order_acquire);
  return h ? &h->value() : nullptr;
}

This release-acquire sequence creates a synchronized-with relationship between releasing of a new head and its acquire by another thread.

Thread A Thread B
st.push(42);
if (auto value = st.top()) {
assert(*value == 42);
}

In the above example, without release-acquire (if memory_order_relaxed were used instead) the assertion could fail because Thread B could see an incompletely initialized node that head already points to (compiler could even reorder node constructor call below setting head in push()). In other words there would be data race.

Upvotes: 3

curiousguy
curiousguy

Reputation: 8318

I don't understand how come std::memory_order_relaxed is used for the failure

And I don't understand how you complain about lack of acquire semantic on that failure branch yet don't complain about

head.load(std::memory_order_relaxed);

and then

while(!head.compare_exchange_weak(new_node->next, new_node,
                                  std::memory_order_release

neither of which has an acquire operation "to be synchronized-with" some other operation that you don't show us. What is that other operation that you care about?

If that operation is important show the operation and tell use how that code depends on the "publication" (or "I'm done" signal) of that other operation.

Answer: the push function in no way depends on the publication of any "I'm done" signal by other function, as push does not use other published data, does not read other pushed elements, etc.

Why not std::memory_order_acquire, instead of std::memory_order_relaxed at failure?

To acquire what? In other words, to observe what accomplishment?

Does std::memory_order_relaxed at failure mean (potentially) more looping?

No. The failure mode has nothing to do with the memory visibility; it's a function of the mechanism of the CPU cache.

EDIT:

I just saw the text in your image:

Shouldn't a store from T2 be visible at T1? (by having synchronized-with relation with each other)

Actually you misunderstood synchronized-with: it doesn't propagate the value of the atomic variable that is being read, as an atomic by definition is a primitive usable with a race condition. A read of an atomic always returns the value of the atomic variable, as written by some other thread (or the same thread). If it wasn't the case, then no atomic operation would be meaningful.

No memory ordering is ever needed to read a single atomic variable.

Upvotes: 0

Related Questions