Reputation: 63
#pragma omp parallel
{
for (i=1; i<1024; i++)
#pragma omp for
for (j=1; j<1024; j++)
A[i][j] = 2*A[i-1][j];
}
I'm using 12 threads to execute this code. Any suggestions what I must do to speed up?
Upvotes: 2
Views: 1259
Reputation: 40709
1) how many cores do you have? You can't get any more parallelism speedup than that, and as others said, probably a lot less.
2) it looks like the inner index j
should start at 0, not 1.
3) That inner loop is crying out for pointers and unrolling, as in
double* pa = &A[i][0];
double* pa1 = &A[i-1][0];
for (j = 0; j < 1024; j += 8){
*pa++ = 2 * *pa1++;
*pa++ = 2 * *pa1++;
*pa++ = 2 * *pa1++;
*pa++ = 2 * *pa1++;
*pa++ = 2 * *pa1++;
*pa++ = 2 * *pa1++;
*pa++ = 2 * *pa1++;
*pa++ = 2 * *pa1++;
}
or...
double* pa = &A[i][0];
double* paEnd = &A[i][1024];
double* pa1 = &A[i-1][0];
for (; pa < paEnd; pa += 8, pa1 += 8){
pa[0] = 2 * pa1[0];
pa[1] = 2 * pa1[1];
pa[2] = 2 * pa1[2];
pa[3] = 2 * pa1[3];
pa[4] = 2 * pa1[4];
pa[5] = 2 * pa1[5];
pa[6] = 2 * pa1[6];
pa[7] = 2 * pa1[7];
}
whichever is faster.
Upvotes: 0
Reputation: 19746
Assuming A's type is smaller than 64Bytes, trying to parallelize the inner loop this way would most likely cause you to have false sharing in cache lines.
Say A is an aligned array of 4-byte ints, you would have A[i][0] through A[i][15] in the same cache line. This means that all 12 threads would attempt to read the line simultaneously, each for the part it needs, this may have resulted in having the line shared between the multiple cores if you left it at that, but you also try to write is back, leading each core to attempt to take ownership on the line in order to modify it.
CPU caches are usually based on MESI based protocols, making a store attempt issue a read-for-ownership that would invalidate the line in each other core except the requester. Issuing 12 parallel (or rather 6 if you have 6 core * 2 threads each) would result in a race where the first one winning the line may very well have it preempted from him by a snoop before it even had a chance to modify it (although that's not likely). The result is quite messy, and may take a while before the line travels to each core in its turn, gets modified, and then snooped out by another core. This recurred for each of the next consecutive groups of 16 elements (again, assuming int).
What you might do is:
This however would prevent you from reaching the full potential of the CPU as you lose the spatial locality and the streaming property of your code. Instead, you may:
There's still a downside here, since if a thread encounters too many streams it might loose track of them. A third approach is, therefore -
Upvotes: 1