Reputation: 11
I encountered an interesting phenomenon that I fail to explain. I haven't found an answer online, as most of the posts deal with weak scaling and thus communication overhead.
Here is a small piece of code to illustrate the problem. This was tested in different languages with similar results, hence the multiple tags.
#include <mpi.h>
#include <stdio.h>
#include <time.h>
int main() {
MPI_Init(NULL,NULL);
int wsize;
MPI_Comm_size(MPI_COMM_WORLD, &wsize);
int wrank;
MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
clock_t t;
MPI_Barrier(MPI_COMM_WORLD);
t=clock();
int imax = 10000000;
int jmax = 1000;
for (int i=0; i<imax; i++) {
for (int j=0; j<jmax; j++) {
//nothing
}
}
t=clock()-t;
printf( " proc %d took %f seconds.\n", wrank,(float)t/CLOCKS_PER_SEC );
MPI_Finalize();
return 0;
}
Now as you can see, the only part that is timed here is the loop. Therefore, with similar CPUs, no hyperthreading, and sufficient RAM, increasing the number of CPUs should produce exactly the same time.
However, on my machine which is 32 cores with 15GiB RAM,
mpirun -np 1 ./test
gives
proc 0 took 22.262777 seconds.
but
mpirun -np 20 ./test
gives
proc 18 took 24.440767 seconds.
proc 0 took 24.454365 seconds.
proc 4 took 24.461191 seconds.
proc 15 took 24.467632 seconds.
proc 14 took 24.469728 seconds.
proc 7 took 24.469809 seconds.
proc 5 took 24.461639 seconds.
proc 11 took 24.484224 seconds.
proc 9 took 24.491638 seconds.
proc 2 took 24.484953 seconds.
proc 17 took 24.490984 seconds.
proc 16 took 24.502146 seconds.
proc 3 took 24.513380 seconds.
proc 1 took 24.541555 seconds.
proc 8 took 24.539808 seconds.
proc 13 took 24.540005 seconds.
proc 12 took 24.556068 seconds.
proc 10 took 24.528328 seconds.
proc 19 took 24.585297 seconds.
proc 6 took 24.611254 seconds.
and values ranging in between for varying numbers of CPUs.
htop also shows an increase in RAM consumption (VIRT is ~100M for 1 core, and ~300M for 20). Although that might be related to the size of the mpi communicator?
Finally, it is definitely related to the size of the problem (and thus not a communication overhead causing a constant delay regardless of the size of the loop ). Indeed, decreasing imax to say 10 000 makes the walltimes similar.
1 core :
proc 0 took 0.028439 seconds.
20 cores :
proc 1 took 0.027880 seconds.
proc 12 took 0.027880 seconds.
proc 8 took 0.028024 seconds.
proc 16 took 0.028135 seconds.
proc 17 took 0.028094 seconds.
proc 19 took 0.028098 seconds.
proc 7 took 0.028265 seconds.
proc 9 took 0.028051 seconds.
proc 13 took 0.028259 seconds.
proc 18 took 0.028274 seconds.
proc 5 took 0.028087 seconds.
proc 6 took 0.028032 seconds.
proc 14 took 0.028385 seconds.
proc 15 took 0.028429 seconds.
proc 0 took 0.028379 seconds.
proc 2 took 0.028367 seconds.
proc 3 took 0.028291 seconds.
proc 4 took 0.028419 seconds.
proc 10 took 0.028419 seconds.
proc 11 took 0.028404 seconds.
It's been tried on several machines with similar results. Maybe we're missing something very simple.
Thanks for the help !
Upvotes: 1
Views: 305
Reputation: 1477
processor with turbo frequencies bound by temperature.
Modern processors are limited by thermal design power (TDP). Whenever the processor is cold, single cores may speed up to the turbo frequency multipliers. When hot, or multiple non-idling cores, the cores are slowed down to the guaranteed base speed. Difference between base and turbo speeds are often around 400MHz. AVX or FMA3 may slowdown even below base speed.
Upvotes: 2