Reputation: 30028

Why does the different order of mutexes for std::scoped_lock affects performance?

Note: this is not a practical problem(I have never locked more than 2 mutexes with scoped_lock), I am mostly curious why implementations of scoped_lock apparently have a relatively big performance hit when locking mutexes in different order.

Example code below, godbolt link.

#include<mutex>
#include<thread>
#include<chrono>
#include<iostream>

std::mutex m1, m2, m3, m4, m5, m6;

int cnt =0;

void f(){
    for (int i=0; i< 500*1000; ++i){
        std::scoped_lock sl{m1, m2, m3, m4, m5, m6};
        cnt++;
    }
}

void f_unord(){
    for (int i=0; i< 500*1000; ++i){
        std::scoped_lock sl{m4, m5, m6, m1, m2, m3};
        cnt++;
    }
}


int main(){
for (int run = 0; run<4; ++run)
{
    {
        const auto start = std::chrono::steady_clock::now();
        std::thread t1(f), t2(f);
        t1.join();
        t2.join();
        const auto end = std::chrono::steady_clock::now();
        std::cout << "same lock order: " << std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count() << std::endl; 
        std::cout << cnt << std::endl;
    }
    {
        const auto start = std::chrono::steady_clock::now();
        std::thread t1(f), t2(f_unord);
        t1.join();
        t2.join();
        const auto end = std::chrono::steady_clock::now();
        std::cout << "different lock order: " << std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count() << std::endl; 
        std::cout << cnt << std::endl;
    }
}
}

note on why this is surprising: I would expect that since mutexes are not movable for implementations to just sort the mutexes by address and use that order of locking.

note on benchmarking on godbolt: I know godbolt is not reliable, I get similar results on my machine in VM:

g++ --version; g++ -O2 -std=c++17 scoped_lock.cpp -pthread; ./a.out

g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

different lock order: 1074

1000000

same lock order: 602

2000000

different lock order: 987

3000000

same lock order: 612

4000000

different lock order: 1012

5000000

same lock order: 585

6000000

different lock order: 1050

7000000

same lock order: 675

8000000

different lock order: 1107

9000000

same lock order: 609

10000000

Upvotes: 2

Answers (3)

nop666

Reputation: 603

It is tied to the implementation. We can imagine that std::scoped_lock is using std::lock in some regular implementations.

When you look at the std::lock doc:

The objects are locked by an unspecified series of calls to lock, try_lock, and unlock. If a call to lock or unlock results in an exception, unlock is called for any locked objects before rethrowing

gcc implementation of std::lock is:

void
    lock(_L1& __l1, _L2& __l2, _L3&... __l3)
    {
      while (true)
        {
          using __try_locker = __try_lock_impl<0, sizeof...(_L3) != 0>;
          unique_lock<_L1> __first(__l1);
          int __idx;
          auto __locks = std::tie(__l2, __l3...);
          __try_locker::__do_try_lock(__locks, __idx);
          if (__idx == -1)
            {
              __first.release();
              return;
            }
        }
    }

As you can see, if you have the same declaration order, this is simple:

t1 std::lock takes the ownership of m1
t2 std::lock waits on m1 to be freed

In the second case, it can take some time to stabilize (it could never happen theoretically)...

Upvotes: 0

Howard Hinnant

Reputation: 218750

As others have stated, it is tied to the implementation. But the implementation can do better than gcc's persistent implementation of trying the same thing over and over again.

Peristent

Lock the first lock and then try_lock the rest. If any of the try_locks fail, unlock everything and try again.

This algorithm works best if both threads list their mutexes in the same order.

For a higher performing and more robust algorithm, the implementation should use what this paper calls Smart & Polite.

Smart & Polite

Lock the first lock and then try_lock the rest. If any of the try_locks fail, unlock everything, then yield, then retry except the first lock is done on the one that previously failed the try_lock.

The paper shows that this algorithm never performs worse than any other, and often performs much better. This analysis includes the more traditional algorithm that sorts the lockables into a global order, and then locks them in that order (therein labeled Ordered).

libc++ and Visual Studio both use Smart & Polite. gcc's libstdc++ uses Persistent.

When using clang on non-Apple platforms, use -stdlib=libc++ to choose libc++ over gcc's std::lib.

Read Dining Philosophers Rebooted for an indepth performance analysis of these algorithms for std::lock.

Upvotes: 5

Tom

Reputation: 889

When both threads use the same mutex order, no deadlock can occur. Thread t2 can only proceed with lock m1 if thread t1 has not yet locked m1 and vice-versa. No deadlock can occur.

In the case that you use a different order for the two threads a deadlock would occur. That is thread t1 has locked m1, m2, and m3, and tries to lock m4, m5, and m6. However, at the same time thread t2 has locked m4, m5, and m6 and tries to lock m1, m2, and m3. The two threads cannot proceed and the deadlock needs to be resolved.

In this case, either scoped lock has to release the mutexes it has already acquired in order to avoid a deadlock. Then the other thread can continue and then the thread has to again acquire all the mutexes and with the next iteration the same happens again.

Upvotes: 2

Why does the different order of mutexes for std::scoped_lock affects performance?

Answers (3)

Related Questions