Nithin Paul Mathew
Nithin Paul Mathew

Reputation: 33

Performance difference while using shared data-structure instead of private data-structure in OpenMP

I've been working on OpenMP and trying to figure out why there is a performance drop when keeping an array as shared instead of private. Any input would be helpful.

When the array is shared, it takes about 65ms to run while if it is made private then it takes about 38ms on an Intel Xeon E5540 CPU. The following code was compiled on Ubuntu with GCC 4.4.3

I don't think it's due to false sharing since only read operations are performed on the array elements.

#define PI 3.14159265
#define large 1000000

double e[large];

int main() {
    int i,j,k,m;
    timeval t1,t2;

    double elapsedtime;
    omp_set_num_threads(16);
    for(i=0;i<large;i++) {
        e[i]=rand();
    }

    gettimeofday(&t1, NULL);

    #pragma omp parallel for private(i) shared(e)   
//  #pragma omp parallel for private(i,e)


    for(i=0;i<large;i++) {
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
    }

    gettimeofday(&t2, NULL);
    elapsedtime = (t2.tv_sec*1000000 + t2.tv_usec) - (t1.tv_sec * 1000000 + t1.tv_usec);
    printf("%f ",elapsedtime/1000);
    return 0;
}

Upvotes: 3

Views: 516

Answers (1)

John_West
John_West

Reputation: 2399

I decided to get rid of global variable. That's your code, modified in several places.

//timings.cpp
#include <sys/time.h>
#include <cstdlib>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <unistd.h>

#define PI 3.14159265
#define large 100000

int main() {
    int i;
    timeval t1,t2;

    double elapsedtime;
    bool b=false;

    double e[large];
    double p[large];

    omp_set_num_threads(1);
    for(i=0;i<large;i++) {
        e[i]=9.0;
    }

   /* for(i=0;i<large;i++) {
       p[i]=9.0;
    }*/

     gettimeofday(&t1, NULL);
  #pragma omp parallel for firstprivate(b) private(i) shared(e)
  //#pragma omp parallel for firstprivate(b) private(e,i)
     for(i=0;i<large;i++) {
        if (!b)
        {
            printf("e[i]=%f, e address: %p, n=%d\n",e[i],&e,omp_get_thread_num());
            b=true;
        }
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
        fmodf((exp(log((sin(e[i]*PI/180)+cos((e[i]*2)*PI/180))*10))*PI),3.0);
    }

    gettimeofday(&t2, NULL);
    elapsedtime = (t2.tv_sec*1000000 + t2.tv_usec) - (t1.tv_sec * 1000000 + t1.tv_usec);
    printf("%f ",elapsedtime/1000);
    return 0;
}

We shall run it through script "1.sh" to automatically measure timings,

#/bin/bash
sed -i '/parallel/ s,#,//#,g' timings.cpp
sed -i '/parallel/ s,////#,#,g' timings.cpp
g++ -O0 -fopenmp timings.cpp -o timings
> time1.txt
for loopvar in {1..10}
do
if [ "$loopvar" -eq 1 ]
then
./timings >> time1.txt;
cat time1.txt;
echo;
else
./timings | tail -1 >> time1.txt;
fi
done
echo "---------"
echo "Total time:"
echo `tail -1 time1.txt | sed s/' '/'+'/g | sed s/$/0/ | bc -li | tail -1`/`tail -1 time1.txt| wc -w | sed s/$/.0/` | bc -li | tail -1

Here are testing results (Intel@ Core 2 Duo E8300):

1) #pragma omp parallel for firstprivate(b) private(i) shared(e)

user@comp:~ ./1.sh
Total time:
152.96380000000000000000

We have strange latencies. E.g. output:

e[i]=9.000000, e address: 0x7fffb67c6960, n=0
e[i]=9.000000, e address: 0x7fffb67c6960, n=7
e[i]=9.000000, e address: 0x7fffb67c6960, n=8
//etc..

Note the address - it's the same for all arrays (so it is called shared)

2) #pragma omp parallel for firstprivate(e,b) private(i)

user@comp:~ ./1.sh
Total time:
157.48220000000000000000

We have copying of data e (firstprivate) to each thread E.g. output:

e[i]=9.000000, e address: 0x7ff93c4238e0, n=1
e[i]=9.000000, e address: 0x7ff939c1e8e0, n=6
e[i]=9.000000, e address: 0x7ff93ac208e0, n=4

3) #pragma omp parallel for firstprivate(b) private(e,i)

Total time:
123.97110000000000000000

No copying of data, only allocation (private are used uninitialized) E.g. output:

 e[i]=0.000000, e address: 0x7fca98bdb8e0, n=1
 e[i]=0.000000, e address: 0x7fffa2d10090, n=0
 e[i]=0.000000, e address: 0x7fca983da8e0, n=2

Here we have different addresses, but all e values contain memory garbage (nills are likely due to mmap memory page preallocation).

To see, that firstprivate(e) is slower because of copying of arrays, let's comment out all calculations (lines with "fmodf") // #pragma omp parallel for firstprivate(b) private(i) shared(e)

Total time:
9.69700000000000000000

// #pragma omp parallel for firstprivate(e,b) private(i)

Total time:
12.83000000000000000000

// #pragma omp parallel for firstprivate(b) private(i,e)

Total time:
9.34880000000000000000

Firstprivate(e) is slow because of copying array. Shared(e) is slow because of calculation lines.

Compile with -O3 -ftree-vectorize slightly decreases time of shared:

// #pragma omp parallel for firstprivate(b) private(i) shared(e)

user@comp:~ ./1.sh
Total time:
141.38330000000000000000

// #pragma omp parallel for firstprivate(b) private(e,i)

Total time:
121.80390000000000000000

Using schedule(static, 256) doesn't make the trick.

Let's continue with -O0 option turned on. Comment out array filling: // e[i]=9.0;

// #pragma omp parallel for firstprivate(b) private(i) shared(e)

Total time:
121.40780000000000000000

// #pragma omp parallel for firstprivate(b) private(e,i)

Total time:
122.33990000000000000000

So, "shared" is slower because of "private" data were used uninitialized (as proposed by commenters).

Let's see the dependence on thread number:

4threads
shared
Total time:
156.95030000000000000000
private
Total time:
121.11390000000000000000

2threads
shared
Total time:
155.96970000000000000000
private
Total time:
126.62130000000000000000

1thread (perfomance goes down ca. twice, I have 2-core machine)
shared
Total time:
283.06280000000000000000
private
Total time:
229.37680000000000000000

To compile this with 1.sh, I manually discomented both "parallel for" lines to give 1.sh comment out both of them.

**1thread without parallel, initialized e[i]**
Total time:
281.22040000000000000000

**1thread without parallel, uninitialized e[i]** 
Total time:
231.66060000000000000000

So, it's not OpenMP issue, but memory/cache using issue. Generation of asm code with

g++ -O0 -S timings.cpp

in both cases gives two differences: one, that can be neglected, in label LC numeration, the other, that one label (L3) contains not 1, but 5 asm lines, when initializing e array:

L3:
movl    -800060(%rbp), %eax
movslq  %eax, %rdx
movabsq $4621256167635550208, %rax
movq    %rax, -800016(%rbp,%rdx,8)

(where initialization occures) and common line: addl $1, -800060(%rbp)

So, it seems like cache issue.

That's not an answer, you can use the code above to study the problem further,

Upvotes: 2

Related Questions