Reputation: 31
I am attempting to speed up this for loop with OpenMP parallelization. I was under the impression that this should split up the work across a number of threads. However, perhaps the overhead is too large for this to give me any speedup.
I should mention that this loop occurs many many many times, and each instance of the loop should be parallelized. The number of loop iterations, newNx, can be as small as 3 or as large as 256. However, if I conditionally have it parallelized only for newNx > 100 (only the largest loops), it still slows down significantly.
Is there anything in here which would cause this to be slower than anticipated? I should also mention that the vectors A,v,b are VERY large, but access is O(1) I believe.
#pragma omp parallel for private(j,k),shared(A,v,b)
for(i=1;i<=newNx;i+=2) {
for(j=1;j<=newNy;j++) {
for(k=1;k<=newNz;k+=1) {
nynz=newNy*newNz;
v[(i-1)*nynz+(j-1)*newNz+k] =
-(v[(i-1)*nynz+(j-1)*newNz+k+1 - 2*(k/newNz)]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + kup+offA] +
v[(i-1)*nynz+(j-1)*newNz+ k-1+2*(1/k)]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + kdo+offA] +
v[(i-1)*nynz+(j - 2*(j/newNy))*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + jup+offA] +
v[(i-1)*nynz+(j-2 + 2*(1/j))*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + jdo+offA] +
v[(i - 2*(i/newNx))*nynz+(j-1)*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + iup+offA] +
v[(i-2 + 2*(1/i))*nynz+(j-1)*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + ido+offA] -
b[(i-1)*nynz + (j-1)*newNz + k])
/A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + ifi+offA];}}}
Upvotes: 3
Views: 4398
Reputation: 33659
Assuming you don't have a race condition you can try fusing the loops. Fusing will give larger chunks to parallelize which will help reduce the effect of false sharing and likely distribute the load better as well.
For a triple loop like this
for(int i2=0; i2<x; i2++) {
for(int j2=0; j2<y; j2++) {
for(int k2=0; k2<z; k2++) {
//
}
}
}
you can fuse it like this
#pragma omp parallel for
for(int n=0; n<(x*y*z); n++) {
int i2 = n/(y*z);
int j2 = (n%(y*z))/z;
int k2 = (n%(y*z))%z;
//
}
In your case you you can do it like this
int i, j, k, n;
int x = newNx%2 ? newNx/2+1 : newNx/2;
int y = newNy;
int z = newNz;
#pragma omp parallel for private(i, j, k)
for(n=0; n<(x*y*z); n++) {
i = 2*(n/(y*z)) + 1;
j = (n%(y*z))/z + 1;
k = (n%(y*z))%z + 1;
// rest of code
}
If this successfully speed up your code then you can feel good that you made your code faster and at the same time obfuscated it even further.
Upvotes: 6