Reputation: 53
I recently started looking into OpenMP, since I will work on some highly computational expensive image analysis project. I use Windows 7 with an Intel i7 (8 cores) and mingw64 gcc 4.8.1. I code in Code::Blocks and I set everything up in order to compile and run it. On several parts in my code I will do some pixel-wise operations which I thought would be a good candidate for parallel processing. To my surprise, it turns out that sequential is faster than parallel processing. I tried different versions of gcc (4.7 - 4.8) for both 32-bit and 64-bit and on two separate computers, but I get always the same performance issue. I then tried to run it with my old Visual Studio 2008 that I had on one of these two computers for which I get a performance increase as expected. Therefore, my question is - why am I not able to see the same effect using gcc. Is there anything I do wrong?
Here is a minimum working example.
#include <omp.h>
#include <cstdlib>
#include <iostream>
int main(int argc, char * argv[])
{
/* process a stack of images - set the number to 1000 for testing */
int imgStack = 1000;
double start_t = omp_get_wtime();
for (int img = 0; img < imgStack; img++)
{
omp_set_num_threads(8);
#pragma omp parallel for default(none)
for (int y = 0; y < 1000000000; y++) /* increased the number of pixels to make it worthwhile and to see a difference*/
{
for (int x = 0; x < 1000000000; x++)
{
unsigned char pixel[4];
pixel[0] = 1;
pixel[1] = 2;
pixel[2] = 3;
pixel[3] = 4;
/* here I would do much more but removed it for testing purposes */
}
}
}
double end_t = (omp_get_wtime() - start_t) * 1000.0;
std::cout << end_t << "ms" << std::endl;
return 0;
}
In the building log I have following
x86_64-w64-mingw32-g++.exe -Wall -O2 -fopenmp -c C:\Code\omptest\main.cpp -o obj\Release\main.o
x86_64-w64-mingw32-g++.exe -o bin\Release\omptest.exe obj\Release\main.o -s C:\mingw-builds\x64-4.8.1-posix-seh-rev5\mingw64\bin\libgomp-1.dll
The output is following
for 1 thread : 43ms
for 8 threads: 594ms
I also tried to turn off the optimisation (-O0) in case the compiler does some loop unrolling. I read about the false sharing issue, therefore I kept any variable within the loop private to make sure that this is not the problem. I'm not good in profiling so I can't tell what is going on underneath, such as internal locks that causes all the threads to wait.
I can't figure out what I'm doing wrong here.
- Edit -
Thanks to everyone. In my real code I have an image stack with 2000 images, each of 2000x2000 pixels size. I tried to simplify the example so that everyone can easily reproduce the issue, in which I simplified it too much with the consequence of causing other issues. You all were completely right. In my real code I use Qt for opening and displaying my images, as well as my own image manager that loads and iterates through the stack to give me one image at a time. I thought providing the whole sample would just be too much and complicate things (i.e. not providing a minimum working example).
I pass all the variables (imageHeight, imageWidth, etc) as const only the pointer to my image as shared. Initially that was a pointer to a QImage. In the loop I set the final pixel value using qtimg->setPixel(...) and it seems that the MSVC compiler deals differently with that compared to the gcc compiler. Finally I replaced the QImage pointer with a pointer to an unsigned char array, which gave me a performance increase as expected.
@Hristo Iliev: Thanks for the information about the thread pool. That's really good to know.
Upvotes: 1
Views: 1471
Reputation: 74395
Due to pixels
only being assigned to and then never used, the whole inner loop gets completely removed by the GCC's optimiser with -O2
as one could easily verify by enabling the tree dumps:
; Function <built-in> (main._omp_fn.0, funcdef_no=1036, decl_uid=21657, cgraph_uid=256)
<built-in> (void * .omp_data_i)
{
<bb 2>:
return;
}
And what you do is that you effectively measure the OpenMP runtime overhead.
With -O0
all the code is kept in place and the run time scales as expected with the number of threads but I doubt that you have ever tested it with a 1000000000 x 1000000000 image.
Upvotes: 1
Reputation: 9781
Given the code example, I can't repeat your result. You have to show your real stack size and image size. Because if the work can be done in only 5ms with 1 thread, multi-thread won't make it quicker. Launching multiple threads will introduce a large overhead, especially when you launched them imgStack
times.
Upvotes: 1