Konstantin Isupov
Konstantin Isupov

Reputation: 199

OpenMP slows down computations

I'm trying to parallelize a simple loop using OpenMP. Below is my code:

#include <iostream>
#include <omp.h>
#include <time.h>
#define SIZE 10000000

float calculate_time(clock_t start, clock_t end) {
    return (float) ((end - start) / (double) CLOCKS_PER_SEC) * 1000;
}

void openmp_test(double * x, double * y, double * res, int threads){
    clock_t start, end;
    std::cout <<  std::endl << "OpenMP, " << threads << " threads" << std::endl;
    start = clock();
    #pragma omp parallel for num_threads(threads)
    for(int i = 0; i < SIZE; i++){
        res[i] = x[i] * y[i];
    }
    end = clock();

    for(int i = 1; i < SIZE; i++){
        res[0] += res[i];
    }
    std::cout << "time: " << calculate_time(start, end) << std::endl;
    std::cout << "result: " << res[0] << std::endl;
}

int main() {

    double *dbl_x = new double[SIZE];
    double *dbl_y = new double[SIZE];
    double *res = new double[SIZE];
    for(int i = 0; i < SIZE; i++){
        dbl_x[i] = i % 1000;
        dbl_y[i] = i % 1000;
    }

    openmp_test(dbl_x, dbl_y, res, 1);
    openmp_test(dbl_x, dbl_y, res, 1);
    openmp_test(dbl_x, dbl_y, res, 2);
    openmp_test(dbl_x, dbl_y, res, 4);
    openmp_test(dbl_x, dbl_y, res, 8);

    delete [] dbl_x;
    delete [] dbl_y;
    delete [] res;
    return 0;
}

I compile it as below

g++ -O3 -fopenmp main.cpp -o ompTest

However, after running the test on a Core-i7, I have the following results:

OpenMP, 1 threads time: 31.468 result: 3.32834e+12

OpenMP, 1 threads time: 18.663 result: 3.32834e+12

OpenMP, 2 threads time: 34.393 result: 3.32834e+12

OpenMP, 4 threads time: 56.31 result: 3.32834e+12

OpenMP, 8 threads time: 108.54 result: 3.32834e+12

I don't understand what I'm doing wrong? Why OpenMP slows down the calculations?

And also, why the first result is significantly slower than the second (both with 1 omp thread)?

My test environment: Core i7-4702MQ CPU @ 2.20GHz, Ubuntu 18.04.2 LTS, g++ 7.4.0.

Upvotes: 3

Views: 291

Answers (2)

John Bollinger
John Bollinger

Reputation: 180048

There are at least two things going on here.

  1. clock() measures elapsed processor time, which can be seen as a measure of the amount of work performed, whereas you want to measure elapsed wall time. See OpenMP time and clock() calculates two different results.

  2. Aggregate processor time should be higher in a parallel program than in a comparable serial program because parallelization adds overhead. The more threads, the more overhead, so speed improvement per added thread decreases with more threads, and can even become negative.

Compare to this variation on your code, which implements a more appropriate method to measure elapsed wall time:

float calculate_time(struct timespec start, struct timespec end) {
    long long start_nanos = start.tv_sec * 1000000000LL + start.tv_nsec;
    long long end_nanos = end.tv_sec * 1000000000LL + end.tv_nsec;
    return (end_nanos - start_nanos) * 1e-6f;
}

void openmp_test(double * x, double * y, double * res, int threads){
    struct timespec start, end;
    std::cout <<  std::endl << "OpenMP, " << threads << " threads" << std::endl;
    clock_gettime(CLOCK_MONOTONIC, &start);

    #pragma omp parallel num_threads(threads)
    for(int i = 0; i < SIZE; i++){
        res[i] = x[i] * y[i];
    }

    clock_gettime(CLOCK_MONOTONIC, &end);

    for(int i = 1; i < SIZE; i++){
        res[0] += res[i];
    }
    std::cout << "time: " << calculate_time(start, end) << std::endl;
    std::cout << "result: " << res[0] << std::endl;
}

The results for me are

OpenMP, 1 threads
time: 92.5535
result: 3.32834e+12

OpenMP, 2 threads
time: 56.128
result: 3.32834e+12

OpenMP, 4 threads
time: 59.8112
result: 3.32834e+12

OpenMP, 8 threads
time: 78.9066
result: 3.32834e+12

Note how the measured time with two threads is cut about in half, but adding more cores doesn't improve things much, and eventually starts trending back toward the single-thread time.* This exhibits the competing effects of performing more work concurrently on my four-core, eight-hyperthread machine, and of the increased overhead and resource contention associated with having more threads to coordinate.

Bottom line: throwing more threads at the task does not necessarily get you the result faster, and it rarely gets you a speedup proportional to the number of threads.


* Full disclosure: I cherry-picked these particular results from among those of several runs. All showed similar trends, but the trend is particularly pronounced -- and therefore probably overemphasized -- in this one.

Upvotes: 3

Hume2
Hume2

Reputation: 236

Currently, you create threads but you give them all the same job.

I think, you forgot the "for" in the pragma, which makes the threads to divide the loop into parts.

    #pragma omp parallel for num_threads(threads)

Upvotes: 3

Related Questions