Increasing number of CPUs decreases performance, with cpu load constant and no communications

Question

I encountered an interesting phenomenon that I fail to explain. I haven't found an answer online, as most of the posts deal with weak scaling and thus communication overhead.

Here is a small piece of code to illustrate the problem. This was tested in different languages with similar results, hence the multiple tags.

#include 
#include 
#include 

int main() {

    MPI_Init(NULL,NULL);

    int wsize;
    MPI_Comm_size(MPI_COMM_WORLD, &wsize);

    int wrank;
    MPI_Comm_rank(MPI_COMM_WORLD, &wrank);


    clock_t t;

    MPI_Barrier(MPI_COMM_WORLD);

    t=clock();

    int imax = 10000000;
    int jmax = 1000;
    for (int i=0; i



Now as you can see, the only part that is timed here is the loop. Therefore, with similar CPUs, no hyperthreading, and sufficient RAM, increasing the number of CPUs should produce exactly the same time. 

However, on my machine which is 32 cores with 15GiB RAM,

mpirun -np 1 ./test 


gives 

 proc 0 took 22.262777 seconds.


but 

mpirun -np 20 ./test


gives 

 proc 18 took 24.440767 seconds.
 proc 0 took 24.454365 seconds.
 proc 4 took 24.461191 seconds.
 proc 15 took 24.467632 seconds.
 proc 14 took 24.469728 seconds.
 proc 7 took 24.469809 seconds.
 proc 5 took 24.461639 seconds.
 proc 11 took 24.484224 seconds.
 proc 9 took 24.491638 seconds.
 proc 2 took 24.484953 seconds.
 proc 17 took 24.490984 seconds.
 proc 16 took 24.502146 seconds.
 proc 3 took 24.513380 seconds.
 proc 1 took 24.541555 seconds.
 proc 8 took 24.539808 seconds.
 proc 13 took 24.540005 seconds.
 proc 12 took 24.556068 seconds.
 proc 10 took 24.528328 seconds.
 proc 19 took 24.585297 seconds.
 proc 6 took 24.611254 seconds.


and values ranging in between for varying numbers of CPUs.

htop also shows an increase in RAM consumption (VIRT is ~100M for 1 core, and ~300M for 20). Although that might be related to the size of the mpi communicator? 

Finally, it is definitely related to the size of the problem (and thus  not a communication overhead causing a constant delay regardless of the size of the loop ). Indeed, decreasing imax to say 10 000 makes the walltimes similar.

1 core :

 proc 0 took 0.028439 seconds.


20 cores : 

 proc 1 took 0.027880 seconds.
 proc 12 took 0.027880 seconds.
 proc 8 took 0.028024 seconds.
 proc 16 took 0.028135 seconds.
 proc 17 took 0.028094 seconds.
 proc 19 took 0.028098 seconds.
 proc 7 took 0.028265 seconds.
 proc 9 took 0.028051 seconds.
 proc 13 took 0.028259 seconds.
 proc 18 took 0.028274 seconds.
 proc 5 took 0.028087 seconds.
 proc 6 took 0.028032 seconds.
 proc 14 took 0.028385 seconds.
 proc 15 took 0.028429 seconds.
 proc 0 took 0.028379 seconds.
 proc 2 took 0.028367 seconds.
 proc 3 took 0.028291 seconds.
 proc 4 took 0.028419 seconds.
 proc 10 took 0.028419 seconds.
 proc 11 took 0.028404 seconds.


It's been tried on several machines with similar results. 
Maybe we're missing something very simple.

Thanks for the help !

Increasing number of CPUs decreases performance, with cpu load constant and no communications

Answers (1)

Related Questions