Parallelize a matrix with openmp to avoid false sharing

Question

I am attempting to optimize a parallel algorithm as such:

 double update(double *src, int x, int y){
     return ((*src)[x][y - 1] + (*src)[x][y + 1] +
            (*src)[x - 1][y] + (*src)[x + 1][y])+(*src)[x][y];
 }

 void func(){
     int h = 300;
     int w = 300;
     int t = 0;
     double e = 90.0;
     double data[300][300];
     double data2[300][300];
     while (d >= e) {
            d = 0.0;
            #pragma omp parallel for collapse(2) reduction(+:d) schedule(static,5624)
            for (y = 1; y < h - 1; y++) {
                for (x = 1; x < w - 1; x++) {

                    if(t % 2 == 0){
                        double o = data2[x][y];
                        double n = update(data2, x, y);

                        data[x][y] = n;
                        d += fabs(o - n);
                    }else{
                        double o = data[x][y];
                        double n = update(data, x, y);
                        data2[x][y] = n;
                        d += fabs(o - n);
                    }

                }
            }

            }
            t += 1;
        }
}

This code works in parallel but when the thread count is set to fairly high like 16 ( I am running an 8 core i9 CPU), the program runs slower than if using only 8 or 4 threads. I am assuming this might have to do with false sharing so I tried to set up some scheduling my logic was as such:

300*300 = 90k, with 16 threads 90k/16 = 5625 which is the default static scheduling I would assume?

However I am writing to a nested data array which is 8 bytes, assuming my cache line is 64 bytes it would mean I can have 8 elements on a cache line which should not be shared between cores. since 5625%8 does not equal 0 it would mean cache lines are shared? so to fix this I simply made it subtracted 1 to get it to 5624.

I guess this means one thread will run a loop more than once but I would assume it's better to not have a shared cache line?

Anyway the result at the end of all this is the time with shared cache lines (5625) is 3.8464 seconds, with not shared cache lines (5624) its 3.8028. The difference is hardly staggering so am I completely misunderstanding how all of this works?

Parallelize a matrix with openmp to avoid false sharing

Answers (1)

Related Questions