Reputation: 158
I was wondering if there is an optimization in gcc that can make some single-threaded code like the example below execute in parallel. If no, why? If yes, what kind of optimizations are possible?
#include <iostream>
int main(int argc, char *argv[])
{
int array[10];
for(int i = 0; i < 10; ++ i){
array[i] = 0;
}
for(int i = 0; i < 10; ++ i){
array[i] += 2;
}
return 0;
}
Added:
Thanks for OpenMP links, and as much as I think it's useful, my question is related to compiling same code without the need to rewrite smth. So basically I want to know if:
Upvotes: 6
Views: 11835
Reputation: 364428
Yes, gcc with -ftree-parallelize-loops=4
will attempt to auto-parallelize with 4 threads, for example.
I don't know how well gcc does at auto-parallelization, but it is something that compiler developers have been working on for years. As other answers point out, giving the compiler some guidance with OpenMP pragmas can give better results. (e.g. by letting the compiler know that it doesn't matter what order something happens in, even when that may slightly change the result, which is common for floating point. Floating point math is not associative.)
And also, only doing auto-parallelization for #pragma omp
loops means only the really important loops get this treatment. -ftree-parallelize-loops
probably benefits from PGO (profile-guided optimization) to know which loops are actually hot and worth parallelizing and/or vectorizing.
It's somewhat related to finding the kind of parallelism that SIMD can take advantage of, for auto-vectorizing loops. (Which is enabled by default at -O3
in gcc, and at -O2
in clang).
Upvotes: 11
Reputation: 505
If you want to parallelize your c++ code, you can use openmp. Official documentation can be found here : openmp doc
Openmp provides pragmas so that you can indicate to the compiler that a portion of code has to use a certain number of threads. Sometimes you can do it manually, and some other pragmas can automatically optimize the number of cores used.
The code below is an example of the official documentation :
#include <cmath>
int main() {
const int size = 256;
double sinTable[size];
#pragma omp parallel for
for(int n=0; n<size; ++n) {
sinTable[n] = std::sin(2 * M_PI * n / size);
}
}
This code will automatically parallelize the for loop, this answers your question. There are a lot of other possibilities offered by openmp, you can read the documentation to learn more.
If you need to understand compiling for openmp support, see this stack overflow thread : openmp compilation thread.
Be careful, If you don't use openmp specific options, pragmas will simply be ignored and your code will be run on 1 thread.
I hope this helps.
Upvotes: 4
Reputation: 153840
Compilers are allowed to do whatever they want as long as the observable behavior (see 1.9 [intro.execution] paragraph 8) is identical to that specified by the [correct(*)] program. Observable behavior is specified in terms of I/O operations (using standard C++ library I/O) and access to volatile
objects (although the compiler actually isn't really required to treat volatile
objects special if it can prove that these aren't in observable memory). To this end the C++ execution system may employ parallel techniques.
Your example program actually has no observable outcome and compilers are good a constant folding programs to find out that the program actually does nothing. At best, the heat radiated from the CPU could be an indication of work but the amount of energy consumed isn't one of the observable effects, i.e., the C++ execution system isn't required to do that. If you compile the code above with clang with optimization turned on (-O2
or higher) it will actually entirely remove the loops (use the -S
option to have the compiler emit assembly code to reasonably easy inspect the results).
Assuming you have actually loops which are forced to be executed, most contemporary compilers (at least, gcc, clang, and icc) will try to vectorize the code taking advantage of SIMD instructions. To do so, the compiler needs to comprehend the operations in the code to prove that parallel execution doesn't change the results or introduced data races (as far as I can tell, the exact results are actually not necessarily retained when floating point operations are involved as some of the compilers happily parallelize, e.g., loops adding float
s although floating point addition isn't associative).
I'm not aware of a contemporary compiler which will utilize different threads of execution to improve the speed of execution without some form of hints like Open MP's pragmas. However, discussion at the committee meetings imply that compiler vendors are considering to so at least.
(*) The C++ standard imposes no restriction on the C++ execution system in case the program execution results in undefined behavior. Correct programs wouldn't invoke any form of undefined behavior.
tl;dr: compilers are allowed but not required to execute code in parallel and most contemporary compilers do so in some situations.
Upvotes: 7
Reputation: 855
The compiler can try to automatically parallelise your code, but it wont do it by creating threads. It may use vectorised instructions (intel intrinsics for an intel CPU, for example) to operate on multiple elements at a time, where it can detect that using those instructions is possible (for example when you perform the same operation multiple times on consecutive elements of a correctly aligned data structure). You can help the compiler by telling it which intrinsic instruction set your CPU supports (-mavx, -msse4.2 ...
for example).
You can also use these instructions directly, but it requires a non-trivial amount of work for the programmer. There are also libraries which do this already (see the vector class here Agner Fog's blog).
You can get the compiler to auto-parallelise using multiple threads by using OpenMP (OpenMP introducion), which is more instructing the compiler to auto-parallelise, than the compiler auto-parallelising by itself.
Upvotes: 12