STao34
STao34

Reputation: 1

Why "#pragma omp atomic" increase runtime when using omp_num_threads is 1?

I am learning how to use OpenMP in C program. I noticed that "#pragma omp atomic" will increase the runtime even if the number of threads is 1 while updating a 1d array. Here is my code:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <mpi.h>
#include <omp.h>

double fixwork(int a, int n) //n==L
{
    int j;
    double s, x, y;
    double t = 0;
    for (j = 0; j < n; j++)
    {
        s = 1.0 * j * a;
        x = (1.0 - cos(s)) / 2.0;
        y = 0.31415926 * x; 
        t += y;
    }

    return t;
}

int main(int argc, char* argv[])
{
    int n = 100000;
    int p = 1;
    int L = 2;
    int q = 100;
    int g = 7;
    int i, j, k;
    double v;

    int np, rank;
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &np);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    
    double* u = (double*)calloc(n * g, sizeof(double));
    double* w = (double*)calloc(n * g, sizeof(double));
    
    double omptime1 = -MPI_Wtime();
#pragma omp parallel for private(k, j, v) num_threads(p)
    for (i = 0; i < n; i++)
    {
        k = i * (int)ceil(1.0 * (i % q) / q);
        for (j = 0; j < g; j++)
        {
            v = fixwork(i * g + j, L);
#pragma omp atomic 
            u[k] += v;
        }
    }
    omptime1 += MPI_Wtime();
    
    printf("\npragma time = %f", omptime1);
    MPI_Finalize();
    return 0;
}

I complied this code by:

mpiicc -qopenmp atomictest.c -o atomic

With 1 openmp thread and 1 mpi process, the observed ratio of time(use atomic)/time(no atomic) is ~ 1.28 (n=1e6), ~1.07 (n=1e7), and even larger for smaller n. It says the atomic directive itself has cost more time to operate. What is the reason for such performance? What is the difference between the machine operations of "omp atomic" and "c++ atomic"? Thanks

Upvotes: 0

Views: 228

Answers (1)

Laci
Laci

Reputation: 2818

It is partially answered here:

If you enable OpenMP, gcc has to generate different code that works for any number of threads that is only known at runtime..... The compiler has to use different atomic instructions that are likely more costly...

Upvotes: 1

Related Questions