Reputation:
I have implemented Conjugate Gradient in FORTRAN by replacing the Linear Algebra subroutines in the wikipedia example by (Fortran) Intel MKL subroutines. (DGEMV, DAXPY and DNRM only. It turns out that a=b is faster than DCOPY and a=2*a is faster than DSCAL)
The answers are correct and there is no problem with the implementation. However, when I compile it as ifort CG.f90 -mkl
The results are :
MKL_SET_DYNAMIC = TRUE ; 140 seconds
MKL_SET_DYNAMIC = FALSE, MKL_SET_NUM_THREADS=1 ; 70 seconds.
MKL_SET_DYNAMIC = FALSE, MKL_SET_NUM_THREADS=2 ; ~100 seconds.
A few points:
M16_LAY_GAS16
which after a lot of searching came down to multpd
ASM. Nothing useful came out otherwise (or maybe, I didn't know where to look) FWIW, I used VTune.KMP_AFFINITY
maps one thread to one processor in serial case and 2 threads to 2 processors in parallel.My question is : Why isn't MKL_DYNAMIC setting number of threads as 1 if that is optimal? I don't necessarily need to use 2 threads if the same work (in lesser time) is done by 1.
Am I doing something wrong or is something wrong with Intel MKL?
Upvotes: 0
Views: 1264
Reputation: 72342
MKL_DYNAMIC
is functionally the same as OMP_DYNAMIC
/omp_set_dynamic()
from the OpenMP standard.
It doesn't mean "magically change the number of threads to run the code as fast as possible". It means that the runtime can, under some circumstances, change the number of threads from the user specified value or the system default, if there are system resource or other implementation specific reasons to do so. Given you haven't specified a number of threads and there are 4 concurrent hardware threads available, I would guess that your MKL_SET_DYNAMIC = TRUE
case is using four threads.
If you ran something like MKL_SET_DYNAMIC=TRUE MKL_SET_NUM_THREADS=16
you might find that the runtime throttles the thread count down to 4 and the performance would be better than MKL_SET_DYNAMIC=FALSE MKL_SET_NUM_THREADS=16
, because the runtime might detect you are asking for more than the number of available concurrent hardware threads. But that is all I would expect it to do.
Upvotes: 3