RubarbPie
RubarbPie

Reputation: 131

On multicore x86 systems, are mutexes implemented using a LOCK'd instruction?

In x86 assembly there is a LOCK prefix that can be applied to an instruction to make it atomic. Is this atomicity across all cores? What is the usual delay involved? For a regular mutex, what instruction is it that's locked?

Thanks. PS: I was taught that on systems that lack such an instruction that mutexes can still be done but it is more laborious. I wonder if anyone does it that way any more.

Upvotes: 13

Views: 4167

Answers (2)

Brendan
Brendan

Reputation: 37232

In x86 assembly there is a LOCK prefix that can be applied to an instruction to make it atomic. Is this atomicity across all cores?

Yes.

What is the usual delay involved?

The cost varies depending on multiple factors (CPU model, CPU speed, bus speed, RAM speed, where the data actually is at the time and what else is trying to use the bus). For extremely old CPUs (8086, 80186, 80286, 80386) there were no caches, and the LOCK locked the bus to ensure that nothing else can interfere. This wouldn't have costed much more than the same instruction without the LOCK, except that everything else trying to use the for the duration of the instruction had to wait (it slowed down everything else more than it slowed down the code itself).

For all modern CPUs (I'm guessing "Pentium III or later") it was changed to rely on the MESI cache coherency protocol where possible. Specifically, the cache line is brought into the cache and changed to the "exclusive" state, then the instruction is done using the local cache alone without locking the bus. The cost of this depends on where the data is - e.g. if it's already in the same CPU's L1 data cache (and not in other CPU's caches) then the LOCK can cost nothing. However, (due to the nature of multi-CPU code) often the cache line is in a different CPU's cache and needs to be transferred from one cache to another and/or invalidated in other CPU's caches (via. something called a "read for ownership" in the MESI cache coherency protocol) so it costs more (but still not as much as locking a bus).

For a regular mutex, what instruction is it that's locked?

Over the years I've seen mutexes implemented about 20 different ways. There's no single instruction that is the same in all of the different implementations.

Note that when you can't acquire a mutex the kernel is told not to give your task any CPU time until the mutex is released; and then when the mutex is released the kernel needs to be told that the task can consume CPU time again. This has race conditions and ends up being an atomic "check if mutex can be acquired and then change a task's state" deep in the kernel's scheduler.

Also note that this is considerably expensive so most implementations try to optimistically do as much as possible in user-space; so that the kernel needn't be involved when the mutex is being acquired if there's no contention; and so that if nothing was blocked waiting for the mutex then the kernel isn't told to unblock waiting tasks that don't exist. There are also variations that spin for a short while if the mutex is contended to increase the chance that the kernel isn't involved.

In other words; the code to acquire and release a mutex typically isn't even in one place - it's two pieces, with one piece is in user-space and another piece is in the kernel.

Upvotes: 5

Jesus Ramos
Jesus Ramos

Reputation: 23268

On x86 lock prefix locks all cores and allows atomicity. For implementing this on other systems without LOCK, CMPXCHG loops are used or tight loops with retry logic that attempt to set the value of something to the expected value. As you can see the 2nd way is more detrimental in most cases as it's just constantly looping trying to set the value (and keeps doing so until it's done). For x86 the delay is minimal and might range from halting the instruction pipeline or flushing it and then executing that instruction atomically (typically a couple of cycles), the 2nd method can't really be estimated as it depends how much contention there is for the value that needs to be modified atomically. For mutex's I believe that it's a flag value that must be acquired (check that the mutex isn't acquired and continuously wait until the mutex is up for grabs then attempt to atomically change a flag to acquire it).

AFAIK IBM processors use the 2nd method as when working on the linux kernel I found that the atomic increment function uses it (maybe it's only for older processors). The x86 platform still uses

lock addl ...;

I will admit though that it's been about a year since I worked in this part of the kernel so I could be wrong on some points.

Upvotes: 10

Related Questions