Sarah
Sarah

Reputation: 133

OpenMP For Loop gets slow by increasing threads

I have a simple for loop over an array. It gets slow when I am using more processors. Here is the code:

#include <omp.h>
#include <sys/time.h>
#include <iostream>
#include <vector>
#include <fstream>

using namespace std;

int main(int argc, char* argv[])
{
    string nth;
    if(argc<2)
    {
         cout << "Not enough parameters have been passed. \n";
         cin.get();
         exit(0);
    }
    else
    {
       nth=argv[1];
    }

    N=1000;
    vector<vector< int> > I;
    int *array= new int[N];
    // Initialize I and array

    struct timeval time_start;
    gettimeofday(&time_start, NULL);
    for (int y=0; y<I.size(); y++) {
        int i= I[y][0];
        int j= I[y][1];
        if (array[i]!=array[j]) {
            int a=array[i];
            int b=array[j];
            int min=min(a,b);

        #pragma omp parallel for shared (a,b,min)
            for (int n=0; n<N; n++)
            {
                if (array[n]==a || array[n]==b) {
                    array[n]=min;
                }
            }
        }
}

    struct timeval time_end;
    gettimeofday(&time_end, NULL);
    double sectiontime = (time_end.tv_sec * 1000000 + time_end.tv_usec) - (time_start.tv_sec * 1000000 + time_start.tv_usec);
    cout<<"Section Time: "<<sectiontime<<endl;
    delete array;
    I.clear();
    return 0;
}

I compile it as:

g++ test.cpp -fopenmp -o outTestPar -std=c++0x

and run it by:

./outTestPar 2

I run it on a machine with 64 cores. I get this as results:

With 2 processor:

[...]$ ./outTestPar 2
Section Time: 28003
[...]$ ./outTestPar 2
Section Time: 20897
[...]$ ./outTestPar 2
Section Time: 19506
[...]$ ./outTestPar 2
Section Time: 22990

With 4 processor:

[...]$ ./outTestPar 4
Section Time: 20362
[...]$ ./outTestPar 4
Section Time: 19963
[...]$ ./outTestPar 4
Section Time: 28147
[...]$ ./outTestPar 4
Section Time: 20857

With 8 processor:

[...]$ ./outTestPar 8
Section Time: 24881
[...]$ ./outTestPar 8
Section Time: 28056

With 16 processor:

[...]$ ./outTestPar 16
Section Time: 24332
[...]$ ./outTestPar 16
Section Time: 26921

With 32 processor:

[...]$ ./outTestPar 32
Section Time: 21858
[...]$ ./outTestPar 32
Section Time: 23367
[...]$ ./outTestPar 32
Section Time: 25200
[...]$ ./outTestPar 32
Section Time: 24813

As you can see, not only is there no improvement, sometimes it gets worse. Any idea what's going on? How can I improve it? I also tried different schedules (static, dynamic, guided). Didn't work and made it worse.

Upvotes: 0

Views: 284

Answers (1)

1201ProgramAlarm
1201ProgramAlarm

Reputation: 32727

The loop you are parallelizing is very small and memory bound. It doesn't take too many cores to saturate memory, and after that performance will fall off.

You can also have issues with the cache if different threads write to the same cache line. The details of this vary with the hardware, but generally if one thread writes a value, the entire cache line will be invalidated for the other threads causing them to reload the line if they are still using it.

One way to reduce the impact of memory writes is to only write when the value changes. So rather than checking for "a" or "b", then writing the minimum of the two values, have your if statement check for the larger of the two values, and replace it with the smaller value. This will reduce the number of writes to memory by not writing out values that don't change.

Upvotes: 1

Related Questions