manywows
manywows

Reputation: 195

Parallelize a matrix with openmp to avoid false sharing

I am attempting to optimize a parallel algorithm as such:

 double update(double *src, int x, int y){
     return ((*src)[x][y - 1] + (*src)[x][y + 1] +
            (*src)[x - 1][y] + (*src)[x + 1][y])+(*src)[x][y];
 }

 void func(){
     int h = 300;
     int w = 300;
     int t = 0;
     double e = 90.0;
     double data[300][300];
     double data2[300][300];
     while (d >= e) {
            d = 0.0;
            #pragma omp parallel for collapse(2) reduction(+:d) schedule(static,5624)
            for (y = 1; y < h - 1; y++) {
                for (x = 1; x < w - 1; x++) {

                    if(t % 2 == 0){
                        double o = data2[x][y];
                        double n = update(data2, x, y);

                        data[x][y] = n;
                        d += fabs(o - n);
                    }else{
                        double o = data[x][y];
                        double n = update(data, x, y);
                        data2[x][y] = n;
                        d += fabs(o - n);
                    }

                }
            }

            }
            t += 1;
        }
}

This code works in parallel but when the thread count is set to fairly high like 16 ( I am running an 8 core i9 CPU), the program runs slower than if using only 8 or 4 threads. I am assuming this might have to do with false sharing so I tried to set up some scheduling my logic was as such:

300*300 = 90k, with 16 threads 90k/16 = 5625 which is the default static scheduling I would assume?

However I am writing to a nested data array which is 8 bytes, assuming my cache line is 64 bytes it would mean I can have 8 elements on a cache line which should not be shared between cores. since 5625%8 does not equal 0 it would mean cache lines are shared? so to fix this I simply made it subtracted 1 to get it to 5624.

I guess this means one thread will run a loop more than once but I would assume it's better to not have a shared cache line?

Anyway the result at the end of all this is the time with shared cache lines (5625) is 3.8464 seconds, with not shared cache lines (5624) its 3.8028. The difference is hardly staggering so am I completely misunderstanding how all of this works?

Upvotes: 0

Views: 736

Answers (1)

Qubit
Qubit

Reputation: 1255

False sharing is most likely not the issue here. The cache lines can only overlap where the index ranges of different threads meet, so at the very last few indices that thread n will do and the first ones of thread n+1. Not only does this represent a very small fraction of the indices, that are also unlikely to be run at the same time, you would imagine all threads to progress at a similar rate and as long as the ranges are not too small this should never happen (i.e thread n should have dealt with the first indices before thread n+1 gets to the end of its range).

This does bring us to one of the problems. Your problem is not particularly large, 5600 iterations on a single core will take a very short time (given how little operations are associated to each iteration of the loop) and the overhead of parallelisation might be bigger than what you gain. One would hope the compiler optimises the problem such that it creates the threads when before the while loop and then simply reuses them, but that would be a fairly aggressive optimisation and I can't say I know whether or not it knows it should be doing that (this may also depend on your compilation options, which you did not share, it may be useful to do that). You could try to do that yourself by moving the #pragma omp parallel directive and then only have the #pragma omp for inside.

Another reason why you might be seeing performance degrade at 16 threads is hyperthreading. Hyperthreading will only increase performance when your threads have to wait for something to happen often, like cache misses. In your case that seems quite unlikely and it is entirely possible that a single thread already utilises the core in full. In that case hyperthreading will hurt performance because the OS will try to give each thread some time on the core and these switches don't come free.

Lastly, since your operations are quite trivial, in a larger case you might run into a bottleneck when it comes to memory bandwidth. However this should not be the case with matrices small enough to just store in the cache as a whole.

The best way to be sure where the bottlenecks are however is to test. You can use tools such as Intel VTune Amplifier to check how your code uses the CPU resources available, this way you can easily identify what your code spends most of its time on.

Upvotes: 0

Related Questions