Reputation: 5081
I'm currently looking into optimizing NUMA locality of my application.
So far I think I understand that memory will be resident to that NUMA node that first touches it after allocation.
My questions in regards to std::vector (using the default allocator) are:
And about NUMA in general:
If memory that already has been touched is paged out to disk and then is accessed again and generates a hard fault, does that count as a new first touch or is the page loaded into the memory resident to the numa node that touched it first originally?
I'm using c++11 threads. So long as I'm inside a thread and allocating/touching new memory, can I be sure that all this memory will be resident to the same numa node, or is it possible that the OS switches the executing CPU underneath my thread while it executes and then some of my allocations will be in one NUMA domain and others in another?
Upvotes: 1
Views: 325
Reputation: 8424
Assuming we're talking about Intel CPUs: on their Nahlem vintage CPUs, if you had two such CPUs, there was a power-on option for telling them how to divide up physical memory between them. The physical architecture is two CPUs connected by QPI, with each CPU controlling its own set of memory SIMMs. The options are,
first half of the physical address space on one CPU, second half on the next, or
alternating of memory pages between CPUs
For the first option, if you allocated a piece of memory it'd be down to the OS where it would take that from in the physical address space, and then I suppose a good scheduler would endeavour to run threads accessing that memory on the CPU that's controlling it. For the second option, if you allocated several pages of memory then that'd be split between the two physical CPUs, and then it wouldn't really matter what the scheduler did with threads accessing it. I actually played around with this briefly, and couldn't really spot the difference; Intel had done a good job on the QPI. I'm less familiar with newer Intel architectures, but I'm assuming that it's more of the same.
The other question really is what do you mean by a NUMA node? If we are referring to modern Intel and AMD CPUs, these present a synthesized SMP environment to software, using things like QPI / Hypertransport (and now their modern equivalents) to do so on top of a NUMA hardware architecture. So when talking NUMA locality, it's really a case of whether or not the OS scheduler will run the thread on a core on a CPU that is controlling the RAM that the thread is accessing (SMP meaning that it can be run on any core and still access, though perhaps with slight latency differences, the memory no matter where in physical memory it was allocated). I don't know the answer, but I think that some will do that. Certainly endeavours I've made to use core affinity for threads and memory has yielded only a tiny improvement over just letting the OS (Linux 2.6) just do it's thing. And the cache systems on modern CPUs and their interaction with inter-CPU interconnects like QPI are very clever.
Older OSes dating back to when SMP really was pure hardware SMP wouldn't know to do that.
Small rabbit hole - if we are referring to a pure NUMA system (Transputers, the Cell processor out of the PS3 and its SPEs) then a thread would be running on a specific core and would be able to access only that core's memory; to access data allocated (by another thread) in another core's memory, the software has to sort that out itself by sending data across some interconnect. This is much harder to code for until learned, but the results can be impressively fast. It took Intel about 10 years to match the Cell processor for raw maths speed.
Upvotes: 3