Reputation: 193
I am going through openMP tutorials and as I progressed, I have written an openMP version of a code that calculates PI by using an integral.
I have written a serial version so I know the serial counterpart is ok. Once the openMP version completed, I noticed that everytime I run it, it gives me a different answer. If I do several runs, I can see that the outputs is broadly around the right number but still, I didn't expect several openMP run give different answers.
#include<stdio.h>
#include<stdlib.h>
#include<omp.h>
void main()
{ int nb=200,i,blob;
float summ=0,dx,argg;
dx=1./nb;
printf("\n dx------------: %f \n",dx);
omp_set_num_threads(nb);
#pragma omp parallel
{
blob=omp_get_num_threads();
printf("\n we have now %d number of threads...\n",blob);
int ID=omp_get_thread_num();
i=ID;
printf("\n i is now: %d \n",i);
argg=(4./(1.+i*dx*i*dx))*dx;
summ=summ+argg;
printf("\t\t and summ is %f \n",summ);
}
printf("\ntotal summ after loop: %f\n",summ);
}
I compile this code on a RedHat using gcc -f mycode.c -fopenmp and when I run it, say 3 times, I get:
3.117
3.113
3.051
Could anyone help to understand why I get different results? Am I doing something wrong? The parallelism just splic the integration interval, but as the rectangles are calculated , it should be the same when they get summed up at the end, no?
the serial version gives me 3.13
(the fact I don't get 3.14 is normal because I used a very coarse sampling of the integral with just 200 divisions between 0 and 1)
I have tried also to add a barrier, but I still get different answers, although closer to the serial-version, still with a spread in values and not identical...
Upvotes: 1
Views: 277
Reputation: 11527
The problem comes from variables summ
, argg
and i
. They belong to the global sequential scope and cannot be modified without precautions. You will have races between threads and this may result in an unexpected values in these var. Races are completely undeterministic and that explains the different results that you get. You may as well get the correct result or any incorrect result depending on the temporal occurrences of reads and writes to these vars.
The proper way to deal with this problem :
for variable argg
and i
: they are declared in the global scope, but they are used to perform temporay computation in the threads. You should : either declare them in the parallel domain to make them thread private, or add private(argg,i)
in the omp directive. Note hat there also a potential problem for blob
, but its value is identical in all threads and this should not modify the behavior of the program.
for variable summ
the situation is different. This is really a global variable that accumulates some values from the threads. It must remain global, but you must add the atomic
openmp directive when modifying it. The complete read-modify-write operation on the variable will become unbreakable and this will ensure a race free modification.
Here is a modified version of your code that give coherent result (but floats are not associative and the last decimal may change).
#include<stdio.h>
#include<stdlib.h>
#include<omp.h>
void main()
{
int nb=200,i,blob;
float summ=0,dx,argg;
dx=1./nb;
printf("\n dx------------: %f \n",dx);
omp_set_num_threads(nb);
# pragma omp parallel private(argg,i)
{
blob=omp_get_num_threads();
printf("\n we have now %d number of threads...\n",blob);
int ID=omp_get_thread_num();
i=ID;
printf("\n i is now: %d \n",i);
argg=(4./(1.+i*dx*i*dx))*dx;
#pragma omp atomic
summ=summ+argg;
printf("\t\t and summ is %f \n",summ);
}
printf("\ntotal summ after loop: %f\n",summ);
}
As already noted, this is not the best way to use threads. Creating and synchronizing threads is expensive and it is rarely required to have more threads that the number of cores.
Upvotes: 1
Reputation: 501
I believe the problem lies in declaring int i
and float argg
outside the parallel loop.
What's happening is that all your 200 threads are overwriting i
and argg
, so sometimes the argg
of a thread is overwritten by argg
from another thread, resulting in the unpredictable error that you observe.
Here is a working code that always prints the same value (up to 6 decimals or so):
void main()
{
int nb = 200, blob;
float summ = 0, dx;// , argg;
dx = 1. / nb;
printf("\n dx------------: %f \n", dx);
omp_set_num_threads(nb);
#pragma omp parallel
{
blob = omp_get_num_threads();
printf("\n we have now %d number of threads...\n", blob);
int i = omp_get_thread_num();
printf("\n i is now: %d \n", i);
float argg = (4. / (1. + i * dx*i*dx))*dx;
summ = summ + argg;
printf("\t\t and summ is %f \n", summ);
}
printf("\ntotal summ after loop: %f\n", summ);
}
However, changing the last line to %.9f reveals that it's not in fact the exact same float number. This is due to numerical errors in floating point additions. a+b+c does not guarantee the same result as a+c+b. You can try this in below example:
First add float* arr = new float[nb];
before the parallel loop AND arr[i] = argg;
within the parallel loop, after argg
is defined, of course. Then add the following after the parallel loop:
float testSum = 0;
for (int i = 0; i < nb; i++)
testSum += arr[i];
printf("random sum: %.9f\n", testSum);
std::sort(arr, arr + nb);
testSum = 0;
for (int i = 0; i < nb; i++)
testSum += arr[i];
printf("sorted sum: %.9f\n", testSum);
testSum = 0;
for (int i = nb-1; i >= 0; i--)
testSum += arr[i];
printf("reversed sum: %.9f\n", testSum);
Most likely, the sorted sum and reversed sum are slightly different, even though they are composed by adding up the exact same 200 numbers.
Another thing you might want to note is that you're very unlikely to find a processor that can actually run 200 threads in parallel. Most common processors can handle 4 to 32 threads, while specialised server processors can go up to 112 threads with the $15k Xeon Platinum 9282.
As such, we usually do the following:
We remove omp_set_num_threads(nb);
to use the recommended number of threads
We remove int i = omp_get_thread_num();
to use int i
from for loop
We rewrite the loop as a for loop:
#pragma omp parallel for
for (int i = 0; i < nb; i++)
{...}
The result should be identical, but you're now only using as many threads as available on the actual hardware. This reduces context switching between threads and should improve time-performance of your code.
Upvotes: 2