LMo
LMo

Reputation: 1

MPI_Wtime changes when using OpenMP

I am trying to run some code which uses MPI + OpenMP, but when I measure the time that the program takes I see a huge difference when I comment out the OpenMP part. I don't know if I am measuring the time wrong or if there is another problem.

This is my code:

#include <iostream>
#include <cstdlib>
#include <omp.h>
#include <mpi.h>

int main(int argc, char*argv[]) {

int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

if (provided != MPI_THREAD_MULTIPLE) {
    throw std::logic_error("Cannot provide threaded MPI\n");
}

try {

    omp_set_num_threads(28);
    
    int wr, ws;
    MPI_Comm_rank(MPI_COMM_WORLD, &wr);
    MPI_Comm_size(MPI_COMM_WORLD, &ws);

    MPI_Barrier(MPI_COMM_WORLD);
    double starttime = MPI_Wtime();

    #pragma omp parallel
    {

        #pragma omp for
        for (long long i = 0; i < 1400000000/ws; i++) {


        }

    }

    double endtime = MPI_Wtime();
    MPI_Barrier(MPI_COMM_WORLD);

    if (wr == 0) {
    std::cout << "Total time " << endtime-starttime << " seconds\n";
    }

}

catch (std::exception& e) {

    std::cout << e.what() << std::endl;
    return 1;

}

MPI_Finalize();

return 0;
}

When I run it with the omp part active, I get the following times for np = 1, 2, 4 respectively (where np is used when running mpirun -np # ./main): 0.0944848, 0.171973, 0.121193. However, when I comment out that part I get times of order 10-7.

Do you have any suggestions on where the problem is? Thanks!

Upvotes: 0

Views: 42

Answers (1)

Joachim
Joachim

Reputation: 876

As some comments already pointed out, the compiler can optimize out the whole loop, because it is empty and therefore dead code. Other then suggested by some comment, the compiler also optimizes out the loop iterations in the OpenMP case. But the compiler does not optimize out the code to create the parallel region and distribute the iterations to the threads. If you change the size of the loop to 100 you don't see any changes in the timing. The difference between execution with OpenMP and without OpenMP is the time to create of the threads for the parallel region.

If you add some code into the loop, that the compiler cannot optimize out, you see expected timing results:

#include <iostream>
#include <mpi.h>

int main(int argc, char *argv[]) {

  int provided;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

  if (provided != MPI_THREAD_MULTIPLE) {
    std::cerr << "Cannot provide threaded MPI\n" << std::endl;
    MPI_Abort(MPI_COMM_SELF, 1);
  }

  int wr, ws;
  MPI_Comm_rank(MPI_COMM_WORLD, &wr);
  MPI_Comm_size(MPI_COMM_WORLD, &ws);

  double starttime = MPI_Wtime();
  double total = 0;

#pragma omp parallel
  {
#pragma omp for reduction(+ : total)
    for (long long i = 0; i < 1400000000 / ws; i++) {
      total += .2 * i;
    }
  }

  double endtime = MPI_Wtime();
  MPI_Barrier(MPI_COMM_WORLD);

  if (wr == 0) {
    std::cout << "Total time " << endtime - starttime << " seconds\n";
    std::cout << "Result: " << total << std::endl;
  }

  MPI_Finalize();

  return 0;
}

By removing the OpenMP functions and include, the code can be simply compiled with and without -fopenmp flag. Number of threads is selected with the OMP_NUM_THREADS env variable:

$ mpicxx -O3 openmp-loop-time.cpp -o serial
$ mpirun -np 2 env OMP_NUM_THREADS=1 ./serial 
Total time 0.370651 seconds
Result: 4.9e+16
$ mpirun -np 2 env OMP_NUM_THREADS=24 ./serial 
Total time 0.369488 seconds
Result: 4.9e+16
$ mpicxx -fopenmp -O3 openmp-loop-time.cpp -o parallel
$ mpirun -np 2 env OMP_NUM_THREADS=24 ./parallel
Total time 0.0188746 seconds
Result: 4.9e+16
$ mpirun -np 2 env OMP_NUM_THREADS=1 ./parallel 
Total time 0.369074 seconds
Result: 4.9e+16

Here execution with 24 threads results in about 1/20 of single-threaded execution time. The serial time with and without OpenMP is the same.

Upvotes: 0

Related Questions