sid
sid

Reputation: 53

C++ OpenMP and gcc 4.8.1 - performance issue when parallelising loops

I recently started looking into OpenMP, since I will work on some highly computational expensive image analysis project. I use Windows 7 with an Intel i7 (8 cores) and mingw64 gcc 4.8.1. I code in Code::Blocks and I set everything up in order to compile and run it. On several parts in my code I will do some pixel-wise operations which I thought would be a good candidate for parallel processing. To my surprise, it turns out that sequential is faster than parallel processing. I tried different versions of gcc (4.7 - 4.8) for both 32-bit and 64-bit and on two separate computers, but I get always the same performance issue. I then tried to run it with my old Visual Studio 2008 that I had on one of these two computers for which I get a performance increase as expected. Therefore, my question is - why am I not able to see the same effect using gcc. Is there anything I do wrong?

Here is a minimum working example.

#include <omp.h>
#include <cstdlib>
#include <iostream>

int main(int argc, char * argv[])
{
   /* process a stack of images - set the number to 1000 for testing */
   int imgStack = 1000;

   double start_t = omp_get_wtime();
   for (int img = 0; img < imgStack; img++)
   {
      omp_set_num_threads(8);
      #pragma omp parallel for default(none)
      for (int y = 0; y < 1000000000; y++) /* increased the number of pixels to make it worthwhile and to see a difference*/
      {
         for (int x = 0; x < 1000000000; x++)
         {
            unsigned char pixel[4];
            pixel[0] = 1;
            pixel[1] = 2;
            pixel[2] = 3;
            pixel[3] = 4;

            /* here I would do much more but removed it for testing purposes */

         }
      }
   }
   double end_t = (omp_get_wtime() - start_t) * 1000.0;
   std::cout << end_t << "ms" << std::endl;

   return 0;
}

In the building log I have following

x86_64-w64-mingw32-g++.exe -Wall -O2 -fopenmp -c C:\Code\omptest\main.cpp -o obj\Release\main.o
x86_64-w64-mingw32-g++.exe -o bin\Release\omptest.exe obj\Release\main.o -s C:\mingw-builds\x64-4.8.1-posix-seh-rev5\mingw64\bin\libgomp-1.dll

The output is following

for 1 thread :   43ms
for 8 threads:  594ms

I also tried to turn off the optimisation (-O0) in case the compiler does some loop unrolling. I read about the false sharing issue, therefore I kept any variable within the loop private to make sure that this is not the problem. I'm not good in profiling so I can't tell what is going on underneath, such as internal locks that causes all the threads to wait.

I can't figure out what I'm doing wrong here.

- Edit -

Thanks to everyone. In my real code I have an image stack with 2000 images, each of 2000x2000 pixels size. I tried to simplify the example so that everyone can easily reproduce the issue, in which I simplified it too much with the consequence of causing other issues. You all were completely right. In my real code I use Qt for opening and displaying my images, as well as my own image manager that loads and iterates through the stack to give me one image at a time. I thought providing the whole sample would just be too much and complicate things (i.e. not providing a minimum working example).

I pass all the variables (imageHeight, imageWidth, etc) as const only the pointer to my image as shared. Initially that was a pointer to a QImage. In the loop I set the final pixel value using qtimg->setPixel(...) and it seems that the MSVC compiler deals differently with that compared to the gcc compiler. Finally I replaced the QImage pointer with a pointer to an unsigned char array, which gave me a performance increase as expected.

@Hristo Iliev: Thanks for the information about the thread pool. That's really good to know.

Upvotes: 1

Views: 1471

Answers (2)

Hristo Iliev
Hristo Iliev

Reputation: 74395

Due to pixels only being assigned to and then never used, the whole inner loop gets completely removed by the GCC's optimiser with -O2 as one could easily verify by enabling the tree dumps:

; Function <built-in> (main._omp_fn.0, funcdef_no=1036, decl_uid=21657, cgraph_uid=256)

<built-in> (void * .omp_data_i)
{
<bb 2>:
  return;

}

And what you do is that you effectively measure the OpenMP runtime overhead.

With -O0 all the code is kept in place and the run time scales as expected with the number of threads but I doubt that you have ever tested it with a 1000000000 x 1000000000 image.

Upvotes: 1

kangshiyin
kangshiyin

Reputation: 9781

Given the code example, I can't repeat your result. You have to show your real stack size and image size. Because if the work can be done in only 5ms with 1 thread, multi-thread won't make it quicker. Launching multiple threads will introduce a large overhead, especially when you launched them imgStack times.

Upvotes: 1

Related Questions