Reputation: 21
I'm running this neat little gravity simulation and in serial execution it takes a little more than 4 minutes, when i parallelize one loop inside a it increases to about 7 minutes and if i try parallelizing more loops it increases to more than 20 minutes. I'm posting a slightly shortened version without some initializations but I think they don't matter. I'm posting the 7 minute version however with some comments where i wanted to add parallelization to loops. Thank you for helping me with my messy code.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#define numb 1000
int main(){
double pos[numb][3],a[numb][3],a_local[3],v[numb][3];
memset(v, 0.0, numb*3*sizeof(double));
double richtung[3];
double t,deltat=0.0,r12 = 0.0,endt=10.;
unsigned seed;
int tcount=0;
#pragma omp parallel private(seed) shared(pos)
{
seed = 25235 + 16*omp_get_thread_num();
#pragma omp for
for(int i=0;i<numb;i++){
for(int j=0;j<3;j++){
pos[i][j] = (double) (rand_r(&seed) % 100000 - 50000);
}
}
}
for(t=0.;t<endt;t+=deltat){
printf("\r%le", t);
tcount++;
#pragma omp parallel for shared(pos,v)
for(int id=0; id<numb; id++){
for(int l=0;l<3;l++){
pos[id][l] = pos[id][l]+(0.5*deltat*v[id][l]);
v[id][l] = v[id][l]+a[id][l]*(deltat);
}
}
memset(a, 0.0, numb*3*sizeof(double));
memset(a_local, 0.0, 3*sizeof(double));
#pragma omp parallel for private(r12,richtung) shared(a,pos)
for(int id=0; id <numb; ++id){
for(int id2=0; id2<id; id2++){
for(int k=0;k<3;k++){
r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
}
for(int k=0; k<3;k++){
richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
a[id][k] += richtung[k]/(((r12)*(r12)));
a_local[k] += (-1.0)*richtung[k]/(((r12)*(r12)));
#pragma omp critical
{
a[id2][k] += a_local[k];
}
}
r12=0.0;
}
}
#pragma omp parallel for shared(pos)
for(int id =0; id<numb; id++){
for(int k=0;k<3;k++){
pos[id][k] = pos[id][k]+(0.5*deltat*v[id][k]);
}
}
deltat= 0.01;
}
return 0;
}
I'm using
g++ -fopenmp -o test_grav test_grav.c
to compile the code and I'm measuring time in the shell just by
time ./test_grav
.
When I used
get_numb_threads()
to get the number of threads it displayed 4. top
also shows more than 300% (sometimes ~380%) cpu usage. Interesting little fact if I start the parallel region before the time-loop (meaning the most outer for-loop) and without any actual #pragma omp for
it is equivalent to making one parallel region for every major (the three second to most outer loops) loop. So I think it is an optimization thing, but I don't know how to solve it. Can anyone help me?
Edit: I made the example verifiable and lowered numbers like numb
to make it better testable but the problem still occurs. Even when I remove the critical region as suggested by TheQuantumPhysicist, just not as severely.
Upvotes: 2
Views: 1105
Reputation: 26276
I believe that critical section is the cause of the problem. Consider taking all critical sections outside the parallelized loop and running them after the parallelization is over.
Try this:
#pragma omp parallel shared(a,pos)
{
#pragma omp for private(id2,k,r12,richtung,a_local)
for(id=0; id <numb; ++id){
for(id2=0; id2<id; id2++){
for(k=0;k<3;k++){
r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
}
for(k =0; k<3;k++){
richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
a[id][k] += richtung[k]/(((r12)*(r12))+epsilon);
a_local[k]+= richtung[k]/(((r12)*(r12))+epsilon)*(-1.0);
}
}
}
}
for(id=0; id <numb; ++id){
for(id2=0; id2<id; id2++){
for(k=0;k<3;k++){
a[id2][k] += a_local[k];
}
}
}
Critical sections will lead to locking and blocking. If you can keep these sections linear, you'll win a lot in performance.
Notice that I'm talking about a syntactic solution, which I don't know whether it works for your case. But to be clear: If every point in your series depends on the next one, then parallelizing is not a solution for you; at least simple parallelization using OpenMP.
Upvotes: 1