Reputation: 55
I want to use OpenMP to parallelize a for-loop calculator which does something like:
B = (int*)malloc(sizeof(int) * N); //N is known
for(i=0;i<500000;i++)
{
for(j=0;j<M;j++) B[j]=i+j; //M is different from N, but M <= N;
some operations on B which produce a variable L;
printf("%d\n",L);
}
I don't need to re-allocate B as its values will be defined for each iteration accordingly. The operations will only use B[0] to B[M-1]. This saves a lot of time in allocating and initialization of B.
In order to use openmp, I changed the code to this:
#pragma omp parallel num_threads(32) private(i,j,B,M,L)
{
B = (int*)malloc(sizeof(int) * N); //N is known
#pragma omp parallel for
for(i=0;i<500000;i++)
{
for(j=0;j<M;j++) B[j]=i+j; //M is different from N, but M <= N;
some operations on B which produce a variable L;
printf("%d\n",L);
}
}
It runs really slow compared to the first one, as it creates a new B array for each thread (so 500000 times). Is there a way to avoid this using openmp?
Upvotes: 1
Views: 206
Reputation: 51533
The main issue is that the iterations of the loop are not being assigned to threads as you wanted. Because you have added again the clause parallel
to #pragma omp for
, and assuming that you have nested parallelism disabled, which by default it is, each of the threads created in the outer parallel
region will execute "sequentially" the code within that region, namely:
#pragma omp parallel for
for(i=0;i<500000;i++){
...
}
Therefore, each thread will execute all the 500000
iterations of the inner loop that you intended to be parallelized. Consequently, removing the parallelism and adding additional overhead (e.g., thread creation) to the sequential code. Nonetheless, one can easily solve this issue by merely removing the second parallel
clause, namely:
#pragma omp parallel num_threads(32) private(i,j,B,M,L)
{
B = (int*)malloc(sizeof(int) * N); //N is known
#pragma omp for
for(i=0;i<500000;i++){
...
}
}
Depending upon the setup where the code will be executed (e.g., in a NUMA
architecture or not, if the malloc
function used is (or not) thread-aware memory allocator, among others) it might be advisable to profile your parallel region to check if it pays off (or not) to move the allocation of the 2D
array to the outside of that region. An example, of what the alternative version might look like:
int total_threads = 32;
int** B = malloc(sizeof(*int) * total_threads);
for(int i = 0; i < total_threads; i++){
B[i] = malloc(N * sizeof(int));
}
#pragma omp parallel num_threads(32) private(i,j,M,L)
{
int threadID = omp_get_thread_num();
#pragma omp for
for(i=0;i<500000;i++)
{
for(j=0;j<M;j++)
B[threadID][j]=i+j; //M is different from N, but M <= N;
some operations on B which produce a variable L;
printf("%d\n",L);
}
}
// you might need to reduce all the values from all threads
// to main thread array.
Upvotes: 2