Reputation: 12084

OpenMP and C++ parallel for loop: why does my code slow down when using OpenMP?

I have a simple question about using OpenMP (with C++) that I hoped someone could help me with. I've included a small example below to illustrate my problem.

#include<iostream>
#include<vector>
#include<ctime>
#include<omp.h>

using namespace std;

int main(){
  srand(time(NULL));//Seed random number generator                                                                               

  vector<int>v;//Create vector to hold random numbers in interval [0,9]                                                                                   
  vector<int>d(10,0);//Vector to hold counts of each integer initialized to 0                                                                    

  for(int i=0;i<1e9;++i)
    v.push_back(rand()%10);//Push back random numbers [0,9]                                                                      

  clock_t c=clock();

  #pragma omp parallel for
  for(int i=0;i<v.size();++i)
    d[v[i]]+=1;//Count number stored at v[i]                                                                                     

  cout<<"Seconds: "<<(clock()-c)/CLOCKS_PER_SEC<<endl;

  for(vector<int>::iterator i=d.begin();i!=d.end();++i)
  cout<<*i<<endl;

  return 0;
}

The above code creates a vector v that contains 1 billion random integers in the range [0,9]. Then, the code loops through v counting how many instances of each different integer there is (i.e., how many ones are found in v, how many twos, etc.)

Each time a particular integer is encountered, it is counted by incrementing the appropriate element of a vector d. So, d[0] counts how many zeroes, d[6] counts how many sixes, and so on. Make sense so far?

My problem is when I try to make the counting loop parallel. Without the #pragma OpenMP statement, my code takes 20 seconds, yet with the pragma it takes over 60 seconds.

Clearly, I've misunderstood some concept relating to OpenMP (perhaps how data is shared/accessed?). Could someone explain my error please or point me in the direction of some insightful literature with appropriate keywords to help my search?

Upvotes: 4

Answers (2)

Pyrce

Reputation: 8571

EDIT Once your race condition is resolved via correct synchronization, then the following paragraph applies, before that your data race conditions unfortunately make speed comparisons mute:

Your program is slowing down because you have 10 possible outputs during the pragma section which are being accessed randomly. OpenMP cannot access any of those elements without a lock (which you would need to provide via synchronization) as a result and locking will cause your threads to have a higher overhead than you gain from counting in parallel.

A solution to make this speed up, is to instead make a local variable for each OpenMP thread which counts all of the 0-10 values that a particular thread has seen. Then sum those up in the master count vector. This will be easily parallelized and much faster as the threads don't need to lock on a shared write vector. I would expect a close to Nx speed up where N is the number of threads from OpenMP as there should be very limited locking required. This solution also avoids a lot of the race conditions currently in your code.

See http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/ for more details on thread local OpenMP

Upvotes: 1

Hristo Iliev

Reputation: 74485

Your code exibits:

race conditions due to unsyncronised access to a shared variable
false and true sharing cache problems
wrong measurement of run time

Race conditions arise because you are concurrently updating the same elements of vector d in multiple threads. Comment out the srand() line and run your code several times with the same number of threads (but with more than one thread). Compare the outputs from different runs.

False sharing occurs when two threads write to memory locations that are close to one another as to result on the same cache line. This results in the cache line constantly bouncing from core to core or CPU to CPU in multisocket systems and excess of cache coherency messages. With 32 bytes per cache line 8 elements of the vector could fit in one cache line. With 64 bytes per cache line the whole vector d fits in one cache line. This makes the code slow on Core 2 processors and slightly slower (but not as slow as on Core 2) on Nehalem and post-Nehalem (e.g. Sandy Bridge) ones. True sharing occurs at those elements that are accesses by two or more threads at the same time. You should either put the increment in an OpenMP atomic construct (slow), use an array of OpenMP locks to protect access to elements of d (faster or slower, depending on your OpenMP runtime) or accumulate local values and then do a final synchronised reduction (fastest). The first one is implemented like this:

#pragma omp parallel for
for(int i=0;i<v.size();++i)
  #pragma omp atomic
  d[v[i]]+=1;//Count number stored at v[i]

The second is implemented like this:

omp_lock_t locks[10];
for (int i = 0; i < 10; i++)
  omp_init_lock(&locks[i]);

#pragma omp parallel for
for(int i=0;i<v.size();++i)
{
  int vv = v[i];
  omp_set_lock(&locks[vv]);
  d[vv]+=1;//Count number stored at v[i]
  omp_unset_lock(&locks[vv]);
}

for (int i = 0; i < 10; i++)
  omp_destroy_lock(&locks[i]);

(include omp.h to get access to the omp_* functions)

I leave it up to you to come up with an implementation of the third option.

You are measuring elapsed time using clock() but it measures the CPU time, not the runtime. If you have one thread running at 100% CPU usage for 1 second, then clock() would indicata an increase in CPU time of 1 second. If you have 8 threads running at 100% CPU usage for 1 second, clock() would indicate an increate in CPU time of 8 seconds (that is 8 threads times 1 CPU second per thread). Use omp_get_wtime() or gettimeofday() (or some other high resolution timer API) instead.

Upvotes: 6

OpenMP and C++ parallel for loop: why does my code slow down when using OpenMP?

Answers (2)

Related Questions