Reputation: 45771
I read spin lock code from here, especially this part
inline void Enter(void)
{
int prev_s;
do
{
prev_s = TestAndSet(&m_s, 0);
if (m_s == 0 && prev_s == 1)
{
break;
}
// reluinquish current timeslice (can only
// be used when OS available and
// we do NOT want to 'spin')
// HWSleep(0);
}
while (true);
}
Why do we need to test two conditions m_s == 0 && prev_s == 1? I think just test prev_s == 1 should be enough. Any ideas?
EDIT: version 2. should we fix in this way if there is a bug?
inline void Enter(void)
{
do
{
if (m_s == 0 && 1 == TestAndSet(&m_s, 0))
{
break;
}
// reluinquish current timeslice (can only
// be used when OS available and
// we do NOT want to 'spin')
// HWSleep(0);
}
while (true);
}
EDIT: version 3. I think version 3 from functional level is correct, but performance is not good enough since each time we need to write, no read test ahead. Is my understanding correct?
inline void Enter(void)
{
do
{
if (1 == TestAndSet(&m_s, 0))
{
break;
}
// reluinquish current timeslice (can only
// be used when OS available and
// we do NOT want to 'spin')
// HWSleep(0);
}
while (true);
}
@dragonfly, here is my bug fix version 4 (fixed a bug in version 2 as you pointed out), could you review whether it is correct please? Thanks!
EDIT: version 4.
inline void Enter(void)
{
do
{
if (m_s == 1 && 1 == TestAndSet(&m_s, 0))
{
break;
}
// reluinquish current timeslice (can only
// be used when OS available and
// we do NOT want to 'spin')
// HWSleep(0);
}
while (true);
}
Upvotes: 3
Views: 2173
Reputation: 8260
None of these examples are neccessarily correct on hardware that has less strict ordering. PowerPC and IA64 are two such examples, and isync and .acq operations are required on the test and set operations that get the lock (similarily lwsync and .rel operations in the exit).
Upvotes: 0
Reputation: 47762
All your versions are correct except 2. Also, your remarks about the check for m_s==0
in version 1 and the reduced performance in version 3 are correct.
The cause of the reduced is the way T&S is implemented, particularly that it causes a write on every call. This is because a write (even though it doesn't actually change the data of m_s
), or intention to write, cause its cache line to be invalidated on other CPUs, which means that when another CPU (also waiting for the same lock) tests m_s
, it cannot read the data from its cache, but has to get the data from the CPU that previously owned it. This phenomenon is called cache ping-pong, after the resemblance with the ping-pong game, where the ball (in this case the cache-line's data) constantly moves between the players (the CPU). If you add the extra test before T&S, the CPUs will all just read, which means they can all have the data in their caches (ie. share), until somebody writes to it.
This happens in versions 1 and 3, because they run T&S on every iteration of the loop.
Note that the remarks about the extra check protecting against other thread incorrectly freeing it are misleading, not because the other thread must not do it, but because such a check doesn't guard against this possibility, even remotely. Imagine that the other thread does it, but just after the locking thread does the check. If you really want to protect yourself against this kind of breakage, you should add another variable holding eg. the thread ID holding the lock, and checking the correct manipulation with the lock with it.
Another issue is that on some architectures, you need memory barriers to ensure good memory ordering, particularly ensuring that the m_s
test actually reads the value every time (some compiler barrier should be sufficient here) and that any read (and maybe write if you wish) that occurs inside the critical section doesn't "leak out", that is, isn't performed by the CPU before the actual lock is taken. Unlock has to be handled similarly. Note that Jon Skeet's version is correct in this respect, because he uses Java (or C#? I'm not sure, but their semantics should be similar) volatile
.
Upvotes: 1
Reputation: 43296
Why both conditions? Because a second thread will also get the lock in that case. (Edit: but that case can't happen if all threads follow the spinlock protocol.)
If the lock is available (signaled) m_s
has the value 1. When taken by some thread, it has the value 0. No other values are allowed.
Consider one thread that wants the lock, whether or not it is signaled at the moment that thread called Enter()
is unimportant. It is allowed to take the lock if m_s
is 1, and it takes by changing it to 0. The first iteration where that happens causes the loop to exit and the thread to have the lock.
Now consider two threads that want the same lock. Both are calling TestAndSet()
, waiting to see the value 1 become 0. Because the function TestAndSet()
is atomic, only one of the waiting threads ever gets to see the value 1. All the other threads only ever see m_s
as 0, and must continue to wait.
The condition where m_s
is 1 after setting it to 0 in this thread implies that some other thread signaled between the atomic operation and the condition. Since only one thread at a time is supposed to have the lock, it seems like it shouldn't happen that can't happen.
I'm guessing that this is an attempt to satisfy the invariant promise of the spinlock. (Edit: I'm no longer so sure, more below...) If it is held, the value of m_s
must be zero. If not, it is one. If setting it to zero didn't "stick", then something weird is happening, and it is best not to assume it is now held by this thread when the invariant is not true.
Edit: Jon Skeet points out that this case might be a flaw in the original implementation. I suspect he's right.
The race condition that is being guarded is against a thread that doesn't have the right to signal the spinlock, signaling the spinlock anyway. If you can't trust callers to follow the rules, spinlocks probably aren't the synchronization method of choice, after all.
Edit 2: The proposed revision looks much better. It would clearly avoid the multi-core cache coherency interaction that the original had due to always writing the sentinel m_s
.
After reading about the TATAS protocol (you can learn something new every day, if you pay attention...) and the multicore cache coherency issue it is addressing, it is clear to me that the original code was trying to do something similar, but without understanding the subtlety behind it. It would indeed have been safe (assuming all callers follow the rules) to drop the redundant check on m_s
as it was written. But the code would have been writing to m_s
on every loop iteration, and that would have played havoc in a real multicore chip with caching per core.
The new spinlock is still vulnerable to a second thread releasing it without holding it. There is no way to repair that. My earlier claim about trusting the callers to follow protocol still applies.
Upvotes: 2
Reputation: 1600
Actually someone may call TestAndSet(&m_s, 1)
i.e. Leave()
from another thread just after TestAndSet(&m_s, 0)
and before if
test in Enter()
. That way lock will not be obtained and m_s
will not be equal to 0
. Therefore such check is needed.
Upvotes: 1
Reputation: 1500225
It looks to me like this is an attempt at optimisation gone slightly wrong. I suspect it's trying for "TATAS" - "Test-and-test-and-set", where it doesn't even try to do the TestAndSet if it can see the lock is already taken.
In a post about spin locks for .NET, Joe Duffy writes this TATAS code as:
class SpinLock {
private volatile int m_taken;
public void Enter() {
while (true) {
if (m_taken == 0 &&
Interlocked.Exchange(ref m_taken, 1) == 0)
break;
}
}
public void Exit() {
m_taken = 0;
}
}
(Note that Joe is using 1 to mean locked and 0 to mean unlocked, unlike the code project sample - either is fine, just don't get confused between the two!)
Note that here the call to Interlocked.Exchange is conditional on m_taken
being 0. This reduces contention - the relatively expensive (I guess) test-and-set operation is avoided there it's unnecessary. I suspect that's what the author was aiming at, but didn't quite get it right.
This is also mentioned in the Wikipedia article about spinlocks under "significant optimisations":
To reduce inter-CPU bus traffic, when the lock is not acquired, the code should loop reading without trying to write anything, until it reads a changed value. Because of MESI caching protocols, this causes the cache line for the lock to become "Shared"; then there is remarkably no bus traffic while a CPU is waiting for the lock. This optimization is effective on all CPU architectures that have a cache per CPU, because MESI is so ubiquitous.
That "loop reading" is exactly what the while loop does - until it sees m_taken
change, it only reads. When it sees the change (i.e. when the lock is released) it has another go at locking.
Of course it's very possible that I'm missing something important - issues like this are very subtle.
Upvotes: 7