Can compiler optimize non-related commands to be executed with different cores?

Question

Compiler can change order of non-correlating commands in term of optimization. Can it also optimize them silently to be executed in different cores?

For example:

...
for (...) 
{
    //...
    int a = a1+a2;
    int b = b1+b2;
    int c = c1+c2;
    int d = d1+d2;
    //...
}
...

May it happen that in terms of optimization not just order of execution may be changed, but also amount of cores? Does compiler have any restrictions in standard?

UPD: I'm not asking how to parallelize the code, I'm asking if it was not parallelized explicitly, can it still be parallelized by compiler?

bolov · Accepted Answer

There is more than meets the eyes here. Most likely the instructions (in your example) will end up being run in parallel, but it's not what you think.

There are many levels of hardware parallelism in a CPU, multiple cores being just the highest one ¹⁾. Inside a CPU core you have other levels of hardware parallelization that are mostly transparent ²⁾ (you don't control them via software and you don't actually see them, only maybe their side-effects sometimes). Pipelines, extra bus lanes, multiple ALUs (Arithmetic Logic Units) and FPUs (Floating Point Units) per core are some of them.

Different stages of your instructions will be run in parallel in the pipelines (modern x86 processors have over a dozen pipeline stages) and possibly different instructions will run in parallel in different ALUS (modern x86 CPUs have around 5 ALUs per core).

All this happens without the compiler doing anything ²⁾. And it's free (given the hardware, it was not free to add this capabilities in the hardware). Executing the instructions in different cores is not free. Creating of different threads is costly. Moving the data to be available to other cores is costly. Synchronization to wait for the execution from other cores is costly. There is a lot of overhead associated with creating and synchronizing threads. It is just not worth it for small instructions like this. And the cases that would have a real benefit from multi-threading would involve an analysis that is way too complicated today so practically not feasible. Someday in the future will have compilers that will be able to identify that your serial algorithm is actually a sort and efficiently and correctly parallelize it. Until then we have to rely on language support, library support and/or developer support for parallelizing algorithms.

¹⁾ well, actually hyper-threading is.

²⁾ As pointed by MSalters:

modern compilers are very much aware of the various ALU's and will do work to benefit from them. In particular, register assignments are optimized so you don't have ALU's compete for the same register, something which may not be apparent from the abstract sequential model.

All this it indirectly influences the execution to benefit the hardware architecture, there are not explicit instructions or declarations.

Can compiler optimize non-related commands to be executed with different cores?

Answers (2)

Related Questions