Reputation: 11
I tried to parallelize the nested loop using OpenMP, but I am not sure if this is the correct way to do that. Here is the part of the code having nested loops. This is just a generic code. I am giving noofrecords as 50k, it takes a lot of time even after parallelization. Can someone suggest any better idea to parallelize the code. I am just parallelizing the outer loop in the below code.
int ctd=0;
#pragma omp parallel for default(none), private(j,k,l), shared(A,B,C,D,ctd)
for(int j=0; j <noofrecords; j++)
{
for( int k=0; k<noofrecords; k++)
{
for( int l=0; l<noofrecords; l++)
{
if(condition)
{
D[ctd].a1=A[i].a1;
ctd++;
}}}}
Upvotes: 1
Views: 608
Reputation: 33699
a1
of type D.a1
with a number of elements equal to the maximum value expected of ctd
.a2
of a1
for each thread.a2
in parallel and use ctd2
to count the size of a2a1
in order from a2
and add ctd2
to ctd
Something like this
int ctd=0;
double *a1 = malloc(sizeof *a1 * N); //Step 1
#pragma omp parallel
{
int ctd2 = 0;
double *a2 = malloc(sizeof *a2 * N); //step 2
#pragma omp for nowait
for(int j=0; j<noofrecords; j++)
for(int k=0; k<noofrecords; k++)
for(int l=0; l<noofrecords; l++)
if(condition) a2[ctd2++] = A[i].a1; //step 3
#pragma omp for schedule(static) ordered
for(int i=0; i<omp_get_num_threads(); i++)
#pragma omp ordered
memcpy(&a1[ctd], a2, sizeof *a1 * ctd2), ctd += ctd2; //step 4
#pragma omp for
for(int j=0; j<ctd; j++) D[j].a1 = a1[j]; // step 5
free(a2);
}
free(a1);
N
should be the maximum size you expect ctd to have. One memory inefficieny in this method is that a2
is allocated with size N
as well which may be too large. A dynamic vector like std::vector
in C++ or GArray
in glib would be more memory efficient.
Upvotes: 0
Reputation: 1145
You can use the collapse clause, in your case you have 3 consecutive for loops so it would be something like: #pragma omp parallel for default(none), private(j,k,l), shared(A,B,C,D,ctd) collapse(3)
It will work if the for loops are consecutive and the code is in the inner-most loop (it's the case in the code you posted). It noofrecords is much bigger than your max thread count, the speedup won't be impressive. If it's slow even in parallel it probably means the bottleneck is not your processing power (more probably it's the ram already working at 100%).
Also, I'm not so sure you really want that private(j,k,l)...
Upvotes: 2