Z boson
Z boson

Reputation: 33669

NUMA systems, virtual pages, and false sharing

As I understand things, for perfromance on NUMA systems, there are two cases to avoid:

  1. threads in the same socket writing to the same cache line (usually 64 bytes)
  2. threads from different sockets writing to the same virtual page (usually 4096 bytes)

A simple example will help. Let's assume I have a two socket sytem and each socket has a CPU with two physical cores (and two logical cores i.e. no Intel hyper-threading or AMD two cores per module). Let me borrow the digram at OpenMP: for schedule

| socket 0    | core 0 | thread 0 |
|             | core 1 | thread 1 |

| socket 1    | core 2 | thread 2 |
|             | core 3 | thread 3 |

So based on case 1 it's best to avoid e.g. thread 0 and thread 1 writing to the same cache line and based on case 2 it's best to avoid e.g. thread 0 writing to the same virtual page as thread 2.

However, I have been informed that on modern processors that the second case is no longer a concern. Threads between sockets can write to the same virtual page efficiently (as long as they don't write to the same cache line).

Is case two no longer a problem? And if it is still a problem what's the correct terminology for this? Is is correct to call both cases a kind of false sharing?

Upvotes: 6

Views: 649

Answers (1)

Aaron Altman
Aaron Altman

Reputation: 1755

You're right about case 1. Some more details about case 2:

Based on the operating system's NUMA policy and any related migration issues, the physical location of the page that threads 0 and 2 are writing to could be socket 0 or socket 1. The cases are symmetrical so let's say that there's a first touch policy and that thread 0 gets there first. The sequence of operations could be:

  1. Thread 0 allocates the page.
  2. Thread 0 does a write to the cache line it'll be working on. That cache line transitions from invalid to modified within cache(s) on socket 0.
  3. Thread 2 does a write to the cache line it'll be working on. To put that line in exclusive state, socket 1 has to send a Read For Ownership to socket 0 and receive a response.
  4. Threads 0 and 2 can go about their business. As long as thread 0 doesn't touch thread 2's cache line or vice versa and nobody else does anything that would change the state of either line, all operations that thread 0 and thread 2 are doing are socket- (and possibly core-) local.

You could swap the order of 2. and 3. without affecting the outcome. Either way, the round trip between sockets in step 3 is going to take longer than the socket-local access in step 2, but that cost is only incurred once for each time thread 2 needs to put its line into a modified state. If execution continues for long enough in between transitions in the state of that cache line, the extra cost will amortize.

Upvotes: 2

Related Questions