Reputation: 797
I mean, in the implementation of an OS, what mechanism can do this job? e.g., in the Linux kernel? Or, as we all know there are tools by which we can achieve this conveniently, like the Windows Task Manager, but what is the internal mechanism?
Upvotes: 1
Views: 308
Reputation: 8444
Quick Answer
In Linux it's sched_setaffinity() or pthread_setaffinity_np() that you need to look at.
Longer Answer
You need to be reasonably careful with core affinity (i.e. designating a process/thread to a certain core). Modern CPUs and OSes do all sorts of things to make it unnecessary in the general case to fiddle with core affinity, and those can start working against you if you do fiddle and get it wrong.
Example
On a dual chip i7 platform it can get quite complicated indeed. On such a platform hyperthreading means that the BIOS reports 16 cores, only 8 of which are real. Binding two threads to a core and its hyperthreaded alter ego could easily result in two slow threads.
Also memory is normally interleaved between the two chips, 4kpage at a time (another BIOS setting). So binding a thread to a particular core can place it further away from the data it's operating on; this can overload the QPI link between the two chips and slow everything up. BTW you can allocate memory local to a chip too, take a look at this. It's a complicated topic, but you may have to embrace that too.
Generally speaking the optimum deployment of threads and their memory across a machine's cores, chips and SIMMs is specific to each PC. For example, consider two i7s in a machine; the optimum deployment depends on how many memory SIMMs have been plugged in. These are things that the operating system is well aware of and it will generally do a pretty good job at moving threads around for the best performance.
You have to have a very particular set circumstances where you will find that doing the distribution yourself is better. And unless you have a very fixed hardware config then you have write your application so that it determines the best deployment for itself each time it is run. That's a lot of programming effort.
Summary
In short, its normally best to leave well alone.
What Intel have done
Lets step back a bit and look at what philosophy lies behind Intel's current designs where two or more chips are present.
Intel decided that in general computers did many different tasks all at once on different data sets, with only modest sharing of data between threads and processes. This allows them to synthesise an SMP architecture using QPI to bind their CPUs together into a common memory map (otherwise it would have been strictly NUMA, not SMP). In the general case this gives excellent performance. Of course, AMD had come to the same conclusion years beforehand and used Hypertransport to implement it.
Importantly it also gives simplicity so far as applications and operating systems are concerned because every core in the whole machine can see the whole of memory, even if only indirectly via QPI.
Exceptions to the Rule
However, if the nature of an application is a massive dataset being processed by a thread on each core then that remoteness of memory over QPI may be a problem. The architecture has to maintain cache coherency across all CPUs, so the QPI link could end up being thrashed with memory accesses and cache coherency traffic. For example on the platform I'm using QPI is only 19GB/s, whereas each CPU has 25GB/s to its three memory banks. That may have changed on more recent chips from Intel.
In such circumstances it may be better to treat the two chips as if they were a NUMA architecture. This can be done by having two copies of the data set NUMA allocated so that each CPU has it's own copy. One would also have the threads process only the local memory. This relieves the burden on the QPI link.
Working Round the Chip's Behaviour
If one is in to optimisation to this extent then one rapidly begins to dislike the generalisations that are built into modern CPU architectures. For examples, caches make assumptions about what data to load and when to load it and when to update RAM and other caches. Generally that's fine, but sometimes one knows better.
To me the best CPU yet is the Cell processor as used in the Playstation 3. In its eight maths cores it has no cache, so no cache coherency, no nothing. The programmer has the sole responsibility for getting DMA engines (something I wish Intel would include) to move data to the right place at the right time to be processed by the right code. Or one can leave the data where it is and DMA the code to the data. It's highly complex and takes a lot of brain power, but get it right and you can get tremendous maths performance (200GFLOPs in 2005; miles ahead of Intel).
As to which philosophy is right? Well, Intel are bashing out Core this and Xeon that, whilst Cell is moribund/dead. Turns out that there's not many programmers out there capable of extracting the peak performance by controlling everything themselves.
Upvotes: 1