Reputation: 33669
As I understand things, for perfromance on NUMA systems, there are two cases to avoid:
A simple example will help. Let's assume I have a two socket sytem and each socket has a CPU with two physical cores (and two logical cores i.e. no Intel hyper-threading or AMD two cores per module). Let me borrow the digram at OpenMP: for schedule
| socket 0 | core 0 | thread 0 |
| | core 1 | thread 1 |
| socket 1 | core 2 | thread 2 |
| | core 3 | thread 3 |
So based on case 1 it's best to avoid e.g. thread 0 and thread 1 writing to the same cache line and based on case 2 it's best to avoid e.g. thread 0 writing to the same virtual page as thread 2.
However, I have been informed that on modern processors that the second case is no longer a concern. Threads between sockets can write to the same virtual page efficiently (as long as they don't write to the same cache line).
Is case two no longer a problem? And if it is still a problem what's the correct terminology for this? Is is correct to call both cases a kind of false sharing?
Upvotes: 6
Views: 649
Reputation: 1755
You're right about case 1. Some more details about case 2:
Based on the operating system's NUMA policy and any related migration issues, the physical location of the page that threads 0 and 2 are writing to could be socket 0 or socket 1. The cases are symmetrical so let's say that there's a first touch policy and that thread 0 gets there first. The sequence of operations could be:
You could swap the order of 2. and 3. without affecting the outcome. Either way, the round trip between sockets in step 3 is going to take longer than the socket-local access in step 2, but that cost is only incurred once for each time thread 2 needs to put its line into a modified state. If execution continues for long enough in between transitions in the state of that cache line, the extra cost will amortize.
Upvotes: 2