Smidstrup
Smidstrup

Reputation: 81

Red-Black Gauss Seidel and OpenMP

I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.

The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.

The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower. Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.

My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.

See the example below.

Anyone has any clue on how to make this more effective with OpenMP?

void redBlackSmooth(std::vector<double> const & b,
                    std::vector<double> & x,
                    double h)
{
    // Setup relevant constants.
    double const invh2 = 1.0/(h*h);
    double const h2 = (h*h);
    int const N = static_cast<int>(x.size());
    double sigma = 0;

    // Setup some boundary conditions.
    x[0] = 0.0;
    x[N-1] = 0.0;

    // Red sweep.
    #pragma omp parallel for shared(b, x) private(sigma)
    for (int i = 1; i < N-1; i+=2)
    {
        sigma = -invh2*(x[i-1] + x[i+1]);
        x[i] = (h2/2.0)*(b[i] - sigma);
    }

    // Black sweep.
    #pragma omp parallel for shared(b, x) private(sigma)
    for (int i = 2; i < N-1; i+=2)
    {
        sigma = -invh2*(x[i-1] + x[i+1]);
        x[i] = (h2/2.0)*(b[i] - sigma);
    }
}

Addition: I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.

Upvotes: 4

Views: 5133

Answers (2)

DOOM
DOOM

Reputation: 1244

I think the main problem is about type of array structure you are using. Lets try comparing results with vectors and arrays. (Arrays = c-arrays using new operator).

Vector and array sizes are N = 10000000. I force the smoothing function to repeat in order to maintain runtime > 0.1secs.

Vector Time:    0.121007        Repeat: 1       MLUPS:  82.6399

Array Time:     0.164009        Repeat: 2       MLUPS:  121.945

MLUPS = ((N-2)*repeat/runtime)/1000000 (Million Lattice Points Update per second)

MFLOPS are misleading when it comes to grid calculation. A few changes in the basic equation can lead to consider high performance for the same runtime.

The modified code:

double my_redBlackSmooth(double *b, double* x, double h, int N)
{
    // Setup relevant constants.
    double const invh2 = 1.0/(h*h);
    double const h2 = (h*h);
    double sigma = 0;

    // Setup some boundary conditions.
    x[0] = 0.0;
    x[N-1] = 0.0;

    double runtime(0.0), wcs, wce;
    int repeat = 1;
    timing(&wcs);
    for(; runtime < 0.1; repeat*=2)
    {
        for(int r = 0; r < repeat; ++r)
        {
            // Red sweep.
            #pragma omp parallel for shared(b, x) private(sigma)
            for (int i = 1; i < N-1; i+=2)
            {
                sigma = -invh2*(x[i-1] + x[i+1]);
                x[i] = (h2*0.5)*(b[i] - sigma);
            }
            // Black sweep.
            #pragma omp parallel for shared(b, x) private(sigma)
            for (int i = 2; i < N-1; i+=2)
            {
                sigma = -invh2*(x[i-1] + x[i+1]);
                x[i] = (h2*0.5)*(b[i] - sigma);
            }
            //  cout << "In Array:      " << r << endl;
        }
        if(x[0] != 0) dummy(x[0]);
        timing(&wce);
        runtime = (wce-wcs);
    }
    //  cout << "Before division:   " << repeat << endl;
    repeat /= 2;
    cout << "Array Time:\t" << runtime << "\t" << "Repeat:\t" << repeat
         << "\tMLUPS:\t" << ((N-2)*repeat/runtime)/1000000.0 << endl;

    return runtime;
}

I didn't change anything in the code except than array type. For better cache access and blocking you should look into data alignment (_mm_malloc).

Upvotes: 0

sbabbi
sbabbi

Reputation: 11181

First of all, make sure that the x vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:

double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);

Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.

In my case this implementation is much faster when x is not cache-aligned.

const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;

#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
    base = j * workGroupSize;
    for (int k = 0; k < workGroupSize; k+=2)
    {
        i = base + k + (redSweep ? 1 : 0);
        if ( i == 0 || i+1 == N) continue;
        sigma = -invh2* ( x[i-1] + x[i+1] );
        x[i] = ( h2/2.0 ) * ( b[i] - sigma );
    }
}

In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.

Upvotes: 1

Related Questions