Alex Guteniev
Alex Guteniev

Reputation: 13719

How C++ Standard prevents deadlock in spinlock mutex with memory_order_acquire and memory_order_release?

TL:DR: if a mutex implementation uses acquire and release operations, could an implementation do compile-time reordering like would normally be allowed and overlap two critical sections that should be independent, from different locks? This would lead to a potential deadlock.


Assume a mutex is inmplement on std::atomic_flag:

struct mutex
{
   void lock() 
   {
       while (lock.test_and_set(std::memory_order_acquire)) 
       {
          yield_execution();
       }
   }

   void unlock()
   {
       lock.clear(std::memory_order_release);
   }

   std::atomic_flag lock; // = ATOMIC_FLAG_INIT in pre-C++20
};

So far looks ok, regarding using single such mutex: std::memory_order_release is sychronized with std::memory_order_acquire.

The use of std::memory_order_acquire/std::memory_order_release here should not raise questions at the first sight. They are similar to cppreference example https://en.cppreference.com/w/cpp/atomic/atomic_flag

Now there are two mutexes guarding different variables, and two threads accessing them in different order:

mutex m1;
data  v1;

mutex m2;
data  v2;

void threadA()
{
    m1.lock();
    v1.use();
    m1.unlock();

    m2.lock();
    v2.use();
    m2.unlock();
}

void threadB()
{
    m2.lock();
    v2.use();
    m2.unlock();

    m1.lock();
    v1.use();
    m1.unlock();
}

Release operations can be reordered after unrelated acquire operation (unrelated operation == a later operation on a different object), so the execution could be transformed as follows:

mutex m1;
data  v1;

mutex m2;
data  v2;

void threadA()
{
    m1.lock();
    v1.use();

    m2.lock();
    m1.unlock();

    v2.use();
    m2.unlock();
}

void threadB()
{
    m2.lock();
    v2.use();

    m1.lock();
    m2.unlock();

    v1.use();
    m1.unlock();
}

So it looks like there is a deadlock.

Questions:

  1. How Standard prevents from having such mutexes?
  2. What is the best way to have spin lock mutex not suffering from this issue?
  3. Is the unmodified mutex from the top of this post usable for some cases?

(Not a duplicate of C++11 memory_order_acquire and memory_order_release semantics?, though it is in the same area)

Upvotes: 8

Views: 919

Answers (2)

Sudoplatov
Sudoplatov

Reputation: 31

You can try CppMem (help page) to test things like that.
CppMem is a tool by the authors of Mathematizing C++ Concurrency.
To prove that some behavior of a program is allowed by the C++ memory model, one has to find an execution of this program for which each of the huge number of the memory model rules isn't violated.
CppMem can do it automatically.

I simplified your program to this (to reduce the number of executions that CppMem has to check):

// RA mutex deadlock
int main() {
  atomic_int m1 = 0;
  atomic_int m2 = 0;
  
  {{{ { 
        // m1.lock
        atomic_compare_exchange_strong_explicit(m1, 0, 1, memory_order_acquire, memory_order_acquire);
        
        // m1.unlock
        m1.store(0,memory_order_release);
        
        // m2 is busy
        m2.load(memory_order_acquire).readsvalue(1);
      }
  ||| { 
        // m2.lock
        atomic_compare_exchange_strong_explicit(m2, 0, 1, memory_order_acquire, memory_order_acquire);
        
        // m1 is busy
        m1.load(memory_order_acquire).readsvalue(1);
      }
  }}}
  return 0;
}

CppMem finds an execution where both threads are waiting for mutexes:

the execution graph

In this execution both threads simultaneously are trying to acquire a lock held by another thread.
That is similar to a deadlock, but fortunately, this cannot last forever because the standard also guarantees this:

  • 6.9.2.3 [intro.progress]:

    18 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

  • 33.5.4 [atomics.order]:

    11 Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

The quotes mean that the write m1.store(0,memory_order_release) (which is the last write in the modification order for m1) will have to become visible to m1.load(memory_order_acquire) eventually, and then the threads will be able to make progress.

Upvotes: 3

Peter Cordes
Peter Cordes

Reputation: 365537

There's no problem in the ISO C++ standard; it doesn't distinguish compile-time vs. run-time reordering, and the code still has to execute as if it ran in source order on the C++ abstract machine. So the effects of m2.test_and_set(std::memory_order_acquire) trying to take the 2nd lock can become visible to other threads while still holding the first (i.e. before m1.reset), but failure there can't prevent m1 from ever being released.

The only way we'd have a problem is if compile-time reordering nailed down that order into asm for some machine, such that the m2 lock retry loop had to exit before actually releasing m1.

Also, ISO C++ only defines ordering in terms of synchronizes-with and what can see what, not in terms of re-ordering operations relative into some new order. That would imply some order existed. No such order that multiple threads can agree on is even guaranteed to exist for separate objects, unless you use seq_cst operations. (And a modification order for each object separately is guaranteed to exist.)

The 1-way-barrier model of acquire and release operations (like the diagram in https://preshing.com/20120913/acquire-and-release-semantics) is a convenient way to think about things, and matches reality for pure-loads and pure-stores on x86 and AArch64 for example. But as far as language-lawyering, it's not how the ISO C++ standard defines things.


You're reordering a whole retry loop, not just a single acquire

Reordering an atomic operation across a long-running loop is a theoretical problem allowed by the C++ standard. P0062R1: When should compilers optimize atomics? points out that delaying a store until after a long-running loop is technically allowed by standard's wording of 1.10p28:

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

But a potentially infinite loop would violate that, not being finite in the deadlock case for example, so compilers must not do that.

It's not "just" a quality-of-implementation issue. A successful mutex lock is an acquire operation, but you should not look at the retry loop as a single acquire operation. Any sane compiler won't.

(The classic example of something that aggressive atomics optimization could break is a progress bar, where the compiler sinks all the relaxed stores out of a loop and then folds all the dead stores into one final store of 100%. See also this Q&A - current compilers don't, and basically treat atomic as volatile atomic until C++ solves the problem of giving programmers a way to let the compiler know when atomics can/can't be optimized safely.)

Upvotes: 8

Related Questions