Luna Morrow
Luna Morrow

Reputation: 11

Overlaying openMP onto MPI program causes slow down of the region parallelised with openMP

I have a particle simulation in C which is split over 4 MPI processes and running fast (compared to serial). However, one region of my implementation is N^2 complexity, where I need to compare each particle against every other particle in that process plus 'border' particles shared from other processes. My plan to speed it up was to parallelize the outer loop with #pragma omp parallel for but every variation of OpenMP pragmas I have tried has resulted in severe slow down of my simulation, due to the nested loop in question taking significantly longer.

I have tried using schedule and reduction when starting the parallel region which didn't do much. I have also tried with 8 or 4 threads, which also didn't help. Plus a range of system sizes (to check if the speedup 'kicked in' once the overhead was worth it - it has not) and compiler optimizations.

Timing reveals that one of my four MPI processes is taking a very long time through this section which is slowing down the whole simulation as they need to wait at the end of each time-step before going to the next. The work is quite evenly distributed so none of the processes should take longer.

Code snippet (roughly copied, but have made some aspects pseudo as their details are unimportant):

double calc_1 = omp_get_wtime();
#pragma omp parallel for num_threads(4) private(distances, particle_one)
for (int p = 0; p < myNumParticles; p = p + 2){
    //Particle* particle_iter = &particles[0];
    // printf("p%d: I am thread %d\n", my_rank, omp_get_thread_num());
    // fflush(stdout);
    double force_x = 0;
    double force_y = 0;
    particle_one[0] = positions[p];
    particle_one[1] = positions[p+1];
    //#pragma omp parallel shared(force_x, force_y)
    // {
    if (particle_one real){
        // for (int i = 0; i < (myNumParticles-8); i = i + 8){   
        //#pragma omp parallel for private(distances)
        for (int i = 0; i < myNumParticles; i += 2){
            if (i != p){
                // lj_counter2++;
                particle_two[0] = positions[i];
                particle_two[1] = positions[i+1];
                if (particle within distance cutoff){
                    force_function(); // modifies forces array
                    force_x += (forces[0]);
                    force_y += (forces[1]);
                } 
                distances[0] = 0;
                distances[1] = 0;
            }
        }
        // repeat conparison against border particles
        for (int j = 0; j < (num_particles_local); j += 2){
            particle_three[0] = myBorderParticles[j];
            particle_three[1] = myBorderParticles[j+1];
            if (particle is real){
                if (within distance cutoff) {
                    force_function(); // modifies forces array
                    force_x += (forces[0]);
                    force_y += (forces[1]);
                } 
                distances[0] = 0;
                distances[1] = 0;
            } else {
                // if particle with position {0, 0} found, you're at the end
                break;
            }
        }
    }
    accelerations[p] = force_x;
    accelerations[p+1] = force_y;
}
double calc_2 = omp_get_wtime();
calc_time += (calc_2-calc_1);

Timing results for a 3600 particle system:

Further timing down below for some smaller systems:

Particles MPI MPI + opemMP
225 27 30
400 37 33
625 60 101
900 99 293
1225 153 490
1600 239 735

I am running it across 4 nodes with 28+ cores each and I'm not requesting an unreasonable amount of memory. It is batched with Slurm and run with mpiexec -n 4 ./executable num_particles box_size > slurms/output.txt.

Upvotes: 1

Views: 63

Answers (0)

Related Questions