ledonter
ledonter

Reputation: 1699

Do lock-free algorithms tend to outperform locking ones when the number of cores is large?

AFAIU there's a common idea that often lock-free code has higher overhead than locking one.

Although, also it seems there's an idea that lock-free algorithms are more scalable under contention.

If there are 2 cores, and 2 threads contending over something like a std::queue (+ mutex) vs. a boost::lockfree::queue (MPMC lock-free queue), locks will likely outperform lock-free (AFAIU because of the general overhead of lock-free algorithms). If I have 50 threads instead, locking version will be really slow because of all the context switching, explicit lock convoy etc.

AFAIU in this case a lock-free version might tend to perform better than a locking one; is this roughly true?

However having 50 threads on a 2 core machine is not a good idea for performance anyway; so I guess a more relevant question would be the same thing on a machine with at least 50 cores.

2-thread versions would use 2 cores at a time and I expect the outcome would be roughly the same; what about 50 cores contending for the same lock? The impact of locking seems to be higher than that of a 2-thread version (only 1 thread can do the job out of 50, not out of 2; so utilization would be not great AFAIU). So do lock-free versions tend to outperform locking ones in this case as well (although it's still 1 out of 50)?

Roughly I'm trying to understand the concept of 'more scalable under contention'; I'm assuming that lock-free might start to outperform locking in oversubscription scenarios;

and I'm asking whether a lock-free version tends to outperform locks with an increase of HW thread count.

P.S.: I know that measuring is the most important; this is all just handwaving; but maybe there are some general ideas around this...

Upvotes: 0

Views: 231

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50278

It seems there's an idea that lock-free algorithms are more scalable under contention.

This is very dependent of the use-case/context so it is not always true. One frequent example is when a lock-free code use CAS instruction compared to a lock-based implementation. When there is some contention, threads just try to brute force the access to cache lines which is inefficient since only one thread will succeed to modify the cache line at a time (assuming no SMT) and since it makes contention even worse. A regular solution to this problem is to use pause instructions but this is far to fix the issue. Indeed, core running pause instruction still consume processing resources (the frequency is generally left unchanged and the thread is scheduled on the hardware thread preventing other threads to run during this time). When the number of threads is bigger than the number of hardware thread, CAS-spinning cause scheduling issues as they can prevent other thread to progress causing unwanted context switches with a quite long quantum (typically 5-20 ms). Context switch are expensive but the OS scheduler wake up tasks only when needed and they operate only on ready tasks. The scheduling overhead is fine as long as the number of the ready tasks is not too big and the latency is not critical (typically when there is not too many ping-pong communications, nor low-latency requirements like for real-time applications).

In practice, in addition of the use-case, the performance of the approaches is very dependent of the hardware and software stacks for example because:

  • the latency of atomic operations (and cache accesses) strongly impact contention, not to mention memory ordering strongly impact the overhead of atomic operation differently on different processor architectures (eg. x86 VS ARM vs POWER);
  • the target processor and caches can significantly impact the context switch overheads (eg. recent security vulnerability mitigations severely impacted such overheads);
  • the scheduling algorithm (eg. basic round-robin vs CFS) can strongly impact the context switch overheads.

in this case a lock-free version might tend to perform better than a locking one; is this roughly true?

It depends of the scheduler and the exact access pattern. In practice, schedulers (like CFS on Linux) can schedule the two threads on the same core so to reduce cache misses due to the cache line invalidation (though some delayed overhead can appear like cache misses when the threads are executed in parallel later). The granularity of the accesses can play a huge role on the overall performance. For many fine-grained accesses requiring a ping-pong execution, the lock-based implementation will be slow because of context switches compared to the small cost of atomic accesses in that case. For many fine grained accesses not requiring ping-pong-like executions, the cost of locks is basically the one of an atomic accesses without cache line invalidations nor any kind of contention (much cheaper but still far slower than non-atomic instruction on x86-64 platforms) and tasks can be scheduled for a quite large quantum so the overall execution can be faster than a lock-free implementation due to the contention being avoided. In practice, lock-free algorithms tend to operate at a smaller granularity impacting overheads (eg. lock on a whole function VS many atomic operations).

For the 50-threads use-case, I agree with the provided explanation since the overhead of CAS should be small with two cores compared to the overhead of context-switches assuming they are frequently done (see above).

what about 50 threads contending for the same lock?

This sentence is very confusing for me: wasn't already supposed to be the case for the previous example (ie. 1 queue with one lock for it)?

Roughly I'm trying to understand the concept of 'more scalable under contention'; I'm assuming that lock-free might start to outperform locking in oversubscription scenarios;

I think the key point to keep in mind is that contention can make threads executing an heavily atomic-based lock-free code run >N time slower on a N core system while a lock-based code can be executed serially in the worst case. The lock-based code is faster in that case even though both are scale poorly.

It is a bit like 2 developers trying to work on the same file at the same time versus one waiting for the other to complete its task before modifying the code. In practice, the second one is often better if the same sections of the file are modified simultaneously by the developers (without any scheduling ahead of time so to reduce the conflicts). Having many threads and few cores is like having many tasks and few developers. A good task scheduling can avoid developers working simultaneously on the same code sections resulting in a good scalability assuming the task do not need to be often suspended (requiring long coffee breaks and some time to remind previous tasks).

and I'm asking whether a lock-free version tends to outperform locks with an increase of HW thread count

Atomic operations can be significantly faster for ping-pong accesses between threads executed on hardware threads belonging to the same core. This is because the cache line is already available in the L1 cache shared by two hardware threads on mainstream processors. Still, atomic operations are far from being cheap even in that case on most processors (because of delays to ensure the consistency of atomic operations between cores).

Upvotes: 3

Related Questions