Kyle Newton
Kyle Newton

Reputation: 21

OpenMP: No Speedup in parallel workloads

So I can't really figure this bit out with my fairly simple OpenMP parallelized for loop. When running on the same input size, P=1 runs in ~50 seconds, but running P=2 takes almost 300 Seconds, with P=4 running ~250 Seconds.

Here's the parallelized loop

double time = omp_get_wtime();

printf("Input Size: %d\n", n);

#pragma omp parallel for private(i) reduction(+:in)
for(i = 0; i < n; i++) {
    double x = (double)(rand() % 10000)/10000;
    double y = (double)(rand() % 10000)/10000;
    if(inCircle(x, y)) {
        in++;
    }
}

double ratio = (double)in/(double)n;
double est_pi = ratio * 4.0;
time = omp_get_wtime() - time;

Runtimes:

p=1, n=1073741824 - 52.764 seconds

p=2, n=1073741824 - 301.66 seconds

p=4, n=1073741824 - 274.784 seconds

p=8, n=1073741824 - 188.224 seconds

Running in a Ubuntu 20.04 VM with 8 cores of a Xeon 5650 and 16gb of DDR3 EEC RAM on top of a FreeNas installation on a Dual Xeon 5650 System with 70Gb of RAM.

Partial Solution:

The rand() function inside of the loop causes the time to jump when running on multiple threads.

Upvotes: 1

Views: 157

Answers (1)

TrentP
TrentP

Reputation: 4722

Since rand() uses state saved from the previous call to generated the next PRN it can't run in multiple threads at the same time. Multiple threads would need to read/write the PRNG state at the same time.

POSIX states that rand() need not be thread safe. This means your code could just not work right. Or the C library might put in a mutex so that only one thread could call rand() at a time. This is what's happening, but it slows the code down considerably. The threads are nearly entirely consumed trying to get access to the rand critical section as nothing else they are doing takes any significant time.

To solve this, try using rand_r(), which does not use shared state, but instead is passed the seed value it should use for state.

Keep in mind that using the same seed for every thread will defeat the purpose of increasing the number of trials in your Monte Carlo simulation. Each thread would just use the exact same pseudo-random sequence. Try something like this:

unsigned int seed;
#pragma omp parallel private(seed)
{
    seed = omp_get_thread_num();
    #pragma omp for private(i) reduction(+:in)
    for(i = 0; i < n; i++) {
        double x = (double)(rand_r(&seed) % 10000)/10000;
        double y = (double)(rand_r(&seed) % 10000)/10000;
        if(inCircle(x, y)) {
            in++;
        }
    }
}

BTW, you might notice your estimate is off. x and y need to be evenly distributed in the range [0, 1], and they are not.

Upvotes: 3

Related Questions