Anu
Anu

Reputation: 1367

Strange behavior when mixing openMP with openMPI

I have some code that is parallelized using openMP (on a for loop). I wanted to now repeat the functionality several times and use MPI to submit to a cluster of machines, keeping the intra node stuff to all still be openMP.

When I only use openMP, I get the speed up I expect (using twice the number of processors/cores finishes in half the time). When I add in the MPI and submit to only one MPI process, I do not get this speed up. I created a toy problem to check this and still have the same issue. Here is the code

#include <iostream>
#include <stdio.h>
#include <unistd.h>
#include "mpi.h"

#include <omp.h>


int main(int argc, char *argv[]) {
    int iam=0, np = 1;
    long i;
    int numprocs, rank, namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    double t1 = MPI_Wtime();
    std::cout << "!!!Hello World!!!" << std::endl; // prints !!!Hello World!!!

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name, &namelen);

    int nThread = omp_get_num_procs();//omp_get_num_threads here returns 1??
    printf("nThread = %d\n", nThread);

    int *total = new int[nThread];
    for (int j=0;j<nThread;j++) {
        total[j]=0;
    }
#pragma omp parallel num_threads(nThread) default(shared) private(iam, i)
    {
        np = omp_get_num_threads();

#pragma omp for schedule(dynamic, 1)
        for (i=0; i<10000000; i++) {
            iam = omp_get_thread_num();
            total[iam]++;
        }
        printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
                iam, np, rank, numprocs,processor_name);
    }

    int grandTotal=0;
    for (int j=0;j<nThread;j++) {
        printf("Total=%d\n",total[j]);
        grandTotal += total[j];
    }
    printf("GrandTotal= %d\n", grandTotal);

    MPI_Finalize();

    double t2 = MPI_Wtime();

    printf("time elapsed with MPI clock=%f\n", t2-t1);
    return 0;
}

I am compiling with openmpi-1.8/bin/mpic++, using the -fopenmp flag. Here is my PBS script

#PBS -l select=1:ncpus=12

setenv OMP_NUM_THREADS 12

/util/mpi/openmpi-1.8/bin/mpirun -np 1 -hostfile $PBS_NODEFILE --map-by node:pe=$OMP_NUM_THREADS /workspace/HelloWorldMPI/HelloWorldMPI

I have also tried with #PBS -l nodes=1:ppn=12, get the same results.

When using half the cores, the program is actually faster (twice as fast!). When I reduce the number of cores, I change both ncpus and OMP_NUM_THREADS. I have tried increasing the actual work (adding 10^10 numbers instead of 10^7 shown here in the code). I have tried removing the printf statements wondering if they were somehow slowing things down, still have the same problem. Top shows that I am using all the CPUs (as set in ncpus) close to 100%. If I submit with -np=2, it parallelizes beautifully on two machines, so the MPI seems to be working as expected, but the openMP is broken

Out of ideas now, anything I can try. What am I doing wrong?

Upvotes: 4

Views: 2275

Answers (1)

cartographer
cartographer

Reputation: 505

I hate to say it, but there's a lot wrong and you should probably just familiarize yourself more with OpenMP and MPI. Nevertheless, I'll try to go through your code and point out the errors I saw.

double t1 = MPI_Wtime();

Starting out: Calling MPI_Wtime() before MPI_Init() is undefined. I'll also add that if you want to do this sort of benchmark with MPI, a good idea is to put a MPI_Barrier() before the call to Wtime so that all the tasks enter the section at the same time.

//omp_get_num_threads here returns 1??

The reason why omp_get_num_threads() returns 1 is that you are not in a parallel region.

#pragma omp parallel num_threads(nThread)

You set num_threads to nThread here which as Hristo Iliev mentioned, effectively ignores any input through the OMP_NUM_THREADS environment variable. You can usually just leave num_threads out and be ok for this sort of simplified problem.

default(shared)

The behavior of variables in the parallel region is by default shared, so there's no reason to have default(shared) here.

private(iam, i)

I guess it's your coding style, but instead of making iam and i private, you could just declare them within the parallel region, which will automatically make them private (and considering you don't really use them outside of it, there's not much reason not to).

#pragma omp for schedule(dynamic, 1)

Also as Hristo Iliev mentioned, using schedule(dynamic, 1) for this problem set in particular is not the best of ideas, since each iteration of your loop takes virtually no time and the total problem size is fixed.

int grandTotal=0;
for (int j=0;j<nThread;j++) {
    printf("Total=%d\n",total[j]);
    grandTotal += total[j];
}

Not necessarily an error, but your allocation of the total array and summation at the end is better accomplished using the OpenMP reduction directive.

double t2 = MPI_Wtime();

Similar to what you did with MPI_Init(), calling MPI_Wtime() after you've called MPI_Finalize() is undefined, and should be avoided if possible.

Note: If you are somewhat familiar with what OpenMP is, this is a good reference and basically everything I explained here about OpenMP is in there.

With that out of the way, I have to note you didn't actually do anything with MPI, besides output the rank and comm size. Which is to say, all the MPI tasks do a fixed amount of work each, regardless of the number tasks. Since there's no decrease in work-per-task for an increasing number of MPI tasks, you wouldn't expect to have any scaling, would you? (Note: this is actually what's called Weak Scaling, but since you have no communication via MPI, there's no reason to expect it to not scale perfectly).

Here's your code rewritten with some of the changes I mentioned:

#include <iostream>
#include <cstdio>
#include <cstdlib>

#include <mpi.h>
#include <omp.h>

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);

    int world_size,
        world_rank;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    int name_len;
    char proc_name[MPI_MAX_PROCESSOR_NAME];
    MPI_Get_processor_name(proc_name, &name_len);

    MPI_Barrier(MPI_COMM_WORLD);
    double t_start = MPI_Wtime();

    // we need to scale the work per task by number of mpi threads,
    // otherwise we actually do more work with the more tasks we have
    const int n_iterations = 1e7 / world_size;

    // actually we also need some dummy data to add so the compiler doesn't just
    // optimize out the work loop with -O3 on
    int data[16];
    for (int i = 0; i < 16; ++i)
        data[i] = rand() % 16;

    // reduction(+:total) means that all threads will make a private
    // copy of total at the beginning of this construct and then
    // do a reduction operation with the + operator at the end (aka sum them
    // all together)
    unsigned int total = 0;
    #pragma omp parallel reduction(+:total)
    {
        // both of these calls will execute properly since we
        // are in an omp parallel region
        int n_threads = omp_get_num_threads(),
            thread_id = omp_get_thread_num();

        // note: this code will only execute on a single thread (per mpi task)
        #pragma omp master
        {
            printf("nThread = %d\n", n_threads);
        }

        #pragma omp for
        for (int i = 0; i < n_iterations; i++)
            total += data[i % 16];

        printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
                thread_id, n_threads, world_rank, world_size, proc_name);
    }

    // do a reduction with MPI, otherwise the data we just calculated is useless
    unsigned int grand_total;
    MPI_Allreduce(&total, &grand_total, 1, MPI_UNSIGNED, MPI_SUM, MPI_COMM_WORLD);

    // another barrier to make sure we wait for the slowest task
    MPI_Barrier(MPI_COMM_WORLD);
    double t_end = MPI_Wtime();

    // output individual thread totals
    printf("Thread total = %d\n", total);

    // output results from a single thread
    if (world_rank == 0)
    {
        printf("Grand Total = %d\n", grand_total);
        printf("Time elapsed with MPI clock = %f\n", t_end - t_start);
    }

    MPI_Finalize();
    return 0;
}

Another thing to note, my version of your code executed 22 times slower with schedule(dynamic, 1) added, just to show you how it can impact performance when used incorrectly.

Unfortunately I'm not too familiar with PBS, as the clusters I use run with SLURM but an example sbatch file for a job running on 3 nodes, on a system with two 6-core processors per node, might look something like this:

#!/bin/bash
#SBATCH --job-name=TestOrSomething
#SBATCH --export=ALL
#SBATCH --partition=debug
#SBATCH --nodes=3
#SBATCH --ntasks-per-socket=1

# set 6 processes per thread here
export OMP_NUM_THREADS=6

# note that this will end up running 3 * (however many cpus 
#   are on a single node) mpi tasks, not just 3. Additionally
#   the below line might use `mpirun` instead depending on the
#   cluster
srun ./a.out

For fun, I also just ran my version on a cluster to test the scaling for MPI and OMP, and got the following (note the log scales):

Scaling (time) for the example code. It's basically perfect.

As you can see, its basically perfect. Actually, 1-16 is 1 MPI task with 1-16 OMP threads, and 16-256 is 1-16 MPI tasks with 16 threads per task, so you can also see that there's no change in behavior between the MPI scaling and OMP scaling.

Upvotes: 3

Related Questions