Reputation: 63

Why is parallel execution slower than sequential for this code?

#pragma omp parallel
{
 for (i=1; i<1024; i++)
  #pragma omp for
  for (j=1; j<1024; j++)
   A[i][j] = 2*A[i-1][j];
}

I'm using 12 threads to execute this code. Any suggestions what I must do to speed up?

Upvotes: 2

Answers (2)

Mike Dunlavey

Reputation: 40709

1) how many cores do you have? You can't get any more parallelism speedup than that, and as others said, probably a lot less.

2) it looks like the inner index j should start at 0, not 1.

3) That inner loop is crying out for pointers and unrolling, as in

double* pa = &A[i][0];
double* pa1 = &A[i-1][0];
for (j = 0; j < 1024; j += 8){
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
}

or...

double* pa = &A[i][0];
double* paEnd = &A[i][1024];
double* pa1 = &A[i-1][0];
for (; pa < paEnd; pa += 8, pa1 += 8){
    pa[0] = 2 * pa1[0];
    pa[1] = 2 * pa1[1];
    pa[2] = 2 * pa1[2];
    pa[3] = 2 * pa1[3];
    pa[4] = 2 * pa1[4];
    pa[5] = 2 * pa1[5];
    pa[6] = 2 * pa1[6];
    pa[7] = 2 * pa1[7];
}

whichever is faster.

Upvotes: 0

Leeor

Reputation: 19746

Assuming A's type is smaller than 64Bytes, trying to parallelize the inner loop this way would most likely cause you to have false sharing in cache lines.

Say A is an aligned array of 4-byte ints, you would have A[i][0] through A[i][15] in the same cache line. This means that all 12 threads would attempt to read the line simultaneously, each for the part it needs, this may have resulted in having the line shared between the multiple cores if you left it at that, but you also try to write is back, leading each core to attempt to take ownership on the line in order to modify it.

CPU caches are usually based on MESI based protocols, making a store attempt issue a read-for-ownership that would invalidate the line in each other core except the requester. Issuing 12 parallel (or rather 6 if you have 6 core * 2 threads each) would result in a race where the first one winning the line may very well have it preempted from him by a snoop before it even had a chance to modify it (although that's not likely). The result is quite messy, and may take a while before the line travels to each core in its turn, gets modified, and then snooped out by another core. This recurred for each of the next consecutive groups of 16 elements (again, assuming int).

What you might do is:

make sure that each individual thread is working on its own cacheline, but adding an internal loop that runs over the required number of elements per line, and parallelizing the loops that skips over this number of elements.

This however would prevent you from reaching the full potential of the CPU as you lose the spatial locality and the streaming property of your code. Instead, you may:

Parallelize the outer loop so that each thread works over a few lines, thereby allowing it to own the entire consecutive stream of memory. However, since you need ordering between the lines, you may have to do a little tweaking here (for e.g. a transpose).

There's still a downside here, since if a thread encounters too many streams it might loose track of them. A third approach is, therefore -

Tile the array - break it into sets of, say, 48 lines, distribute them between the threads so that each runs on a few full lines (the transpose trick still applies here, btw), and then continue to the next group

Upvotes: 1

Why is parallel execution slower than sequential for this code?

Answers (2)

Related Questions