Reputation: 11
Below is my function that I'm trying to optimize using OpenMP and Loop Tiling(aka Loop Blocking). However, my output of out currently gives the wrong value after I apply the loop tiling like below. Can someone look over my code, and point out what makes it wrong. Thank you so much
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "utils.h"
const long BLOCK_SIZE = 8*DIM;
int i, j, k,ii,jj,kk, dim = DIM-1;
long compute, out = 1.0, we_need, gimmie;
void work_it_par(long *old, long *new)
{
we_need = need_func();
gimmie = gimmie_func();
#pragma omp parallel for private(i,j,k,ii,jj,kk, compute) firstprivate(we_need, gimmie, dim,old,BLOCK_SIZE) reduction(+:out) num_threads(omp_get_num_procs())
for (ii=1; ii<dim-BLOCK_SIZE; ii+=BLOCK_SIZE) {
for (jj=1; jj<dim-BLOCK_SIZE; jj+=BLOCK_SIZE) {
for (kk=1; kk<dim-BLOCK_SIZE; kk+=BLOCK_SIZE) {
for (i=ii; i<ii+BLOCK_SIZE; i++) {
for (j=jj; j<jj+BLOCK_SIZE; j++) {
for (k=kk; k<kk+BLOCK_SIZE; k++) {
//int temp = i*DIM*DIM+j*DIM+k;
compute = old[i*DIM*DIM+j*DIM+k] * we_need;
out += compute / gimmie;
}
}
}
}
}
}
printf("AGGR:%ld\n",out);
}
Upvotes: 0
Views: 1151
Reputation: 9489
First of all, const long BLOCK_SIZE = 8*DIM;
seems super fishy to me...
Maybe replacing the *
by a /
would be more of what you wanted?
But even though, you still have to deal with the limits by checking that the i
, j
and k
indexes do not go over their limits. I let you figure out how to achieve that.
Last point on the algorithm: are you sure your loops have to start from index 1?
Finally, a few notes on the OpenMP correctness:
firstprivate(we_need, gimmie, dim,old,BLOCK_SIZE)
doesn't make much sense. These could happily stay shared
.num_threads(omp_get_num_procs())
is correct or not. My feeling is that it is indeed valid, but just for "safety", I would tend to separate the call to the function from the directive (by either calling the function first and storing its result in a constant, and using it in the directive, or calling omp_set_num_threads()
before the parallel
directive)collapse
directive to increase the level of parallelism you achieve here...Good luck with your code.
Upvotes: 1