Reputation: 11
I have a particle simulation in C which is split over 4 MPI processes and running fast (compared to serial). However, one region of my implementation is N^2 complexity, where I need to compare each particle against every other particle in that process plus 'border' particles shared from other processes. My plan to speed it up was to parallelize the outer loop with #pragma omp parallel for
but every variation of OpenMP pragmas I have tried has resulted in severe slow down of my simulation, due to the nested loop in question taking significantly longer.
I have tried using schedule
and reduction
when starting the parallel region which didn't do much. I have also tried with 8 or 4 threads, which also didn't help. Plus a range of system sizes (to check if the speedup 'kicked in' once the overhead was worth it - it has not) and compiler optimizations.
Timing reveals that one of my four MPI processes is taking a very long time through this section which is slowing down the whole simulation as they need to wait at the end of each time-step before going to the next. The work is quite evenly distributed so none of the processes should take longer.
Code snippet (roughly copied, but have made some aspects pseudo as their details are unimportant):
double calc_1 = omp_get_wtime();
#pragma omp parallel for num_threads(4) private(distances, particle_one)
for (int p = 0; p < myNumParticles; p = p + 2){
//Particle* particle_iter = &particles[0];
// printf("p%d: I am thread %d\n", my_rank, omp_get_thread_num());
// fflush(stdout);
double force_x = 0;
double force_y = 0;
particle_one[0] = positions[p];
particle_one[1] = positions[p+1];
//#pragma omp parallel shared(force_x, force_y)
// {
if (particle_one real){
// for (int i = 0; i < (myNumParticles-8); i = i + 8){
//#pragma omp parallel for private(distances)
for (int i = 0; i < myNumParticles; i += 2){
if (i != p){
// lj_counter2++;
particle_two[0] = positions[i];
particle_two[1] = positions[i+1];
if (particle within distance cutoff){
force_function(); // modifies forces array
force_x += (forces[0]);
force_y += (forces[1]);
}
distances[0] = 0;
distances[1] = 0;
}
}
// repeat conparison against border particles
for (int j = 0; j < (num_particles_local); j += 2){
particle_three[0] = myBorderParticles[j];
particle_three[1] = myBorderParticles[j+1];
if (particle is real){
if (within distance cutoff) {
force_function(); // modifies forces array
force_x += (forces[0]);
force_y += (forces[1]);
}
distances[0] = 0;
distances[1] = 0;
} else {
// if particle with position {0, 0} found, you're at the end
break;
}
}
}
accelerations[p] = force_x;
accelerations[p+1] = force_y;
}
double calc_2 = omp_get_wtime();
calc_time += (calc_2-calc_1);
Timing results for a 3600 particle system:
MPI only:
373.6 seconds (of which 332.47-349.08 is spent in the above nested loop)
MPI plus openMP:
1606.2 seconds (of which 583.09-1579.96 is spent in the nested loop, depending on the process)
Further timing down below for some smaller systems:
Particles | MPI | MPI + opemMP |
---|---|---|
225 | 27 | 30 |
400 | 37 | 33 |
625 | 60 | 101 |
900 | 99 | 293 |
1225 | 153 | 490 |
1600 | 239 | 735 |
I am running it across 4 nodes with 28+ cores each and I'm not requesting an unreasonable amount of memory. It is batched with Slurm and run with mpiexec -n 4 ./executable num_particles box_size > slurms/output.txt
.
Upvotes: 1
Views: 63