Caco
Caco

Reputation: 1654

increase number of threads decrease time

I'm newbie in openmp. Beginning with a tutorial from the official page of openmp https://www.youtube.com/playlist?list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

In that page there is a hello world program to calculate pi by an approximation of integral. I simply wrote the code below following the instructions but the time-speed of it increase as I increase the number of threads changing the NUM_THREADS. In the video, the speed goes down.

I'm executing the program in a remote server with 64 cpus having 8 cores each.

#include <stdio.h>
#include <omp.h>
static long num_steps = 100000;
double step;

#define NUM_THREADS 2 

int main()
{
    int i, nthreads; double pi, sum[NUM_THREADS];
    double start_t;

    step = 1.0 / (double) num_steps;

    omp_set_num_threads(NUM_THREADS);

    start_t = omp_get_wtime();
    #pragma omp parallel
    {
        int i, id, nthrds;
        double x;

        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();
        if (id == 0) nthreads = nthrds;
        for (i = id, sum[id] = 0.0; i < num_steps; i = i + nthrds) {
            x = (i + 0.5) * step;
            sum[id] += 4.0 / (1.0 + x*x);
        }
    }
    for (i = 0, pi = 0.0; i < nthreads; i++) {
        pi += sum[i] * step;
    }
    printf("%f\n", omp_get_wtime() - start_t);
}

Upvotes: 0

Views: 688

Answers (1)

Hristo Iliev
Hristo Iliev

Reputation: 74455

This is a bad approach to implementing reduction using shared arrays. The successive elements of sum are too close to each other and therefore reside in the same cache line. On cache-coherent architectures like x86/x64, this leads to a problem known as false sharing. The following simple modification will get rid of it:

double sum[8*NUM_THREADS];

#pragma omp parallel
{
    ...
    for (i = id, sum[id] = 0.0; i < num_steps; i = i + nthrds) {
        ...
        sum[8*id] += 4.0 / (1.0 + x*x);
    }
}
for (i = 0, pi = 0.0; i < nthreads; i++) {
    pi += sum[8*i] * step;
}

Only the relevant changes are shown. The idea is simple: instead of having threads access successive elements of sum, make them access every 8-th element. Thus it is guaranteed that threads do not share the same cache line as on most modern CPUs a cache line is 64 bytes long and that corresponds to 64 / sizeof(double) = 8 array elements.

Edit: my mistake, should have watched the video in the first place. False sharing is explained just after the results from running the code are shown. If you don't get any speedup in your case, that's probably because newer CPU generations handle false sharing better.

Upvotes: 3

Related Questions