Reputation: 881
Running wide and linear and deep model inference with plain vanilla tf1.11 that comes pre built with deep learning images version M9 for gpu shows much better performance relative to version M10 (for cpu inference)
M9 : tf-latest-cu92
M10 : tf-latest-cpu
In both images tf version is 1.11 and they come prebuilt with intel mkl optimized binaries. I turn on the verbosity logging for mkl instructions and on M10 image i see a lot of mkl related settings
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32
And logging of mkl instruction with timings. on M9 image i dont observe any such thing even though both images show version info as :
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x5622b7736500,1,0x5622b7736500,1) 2.54ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0
I am seeing 2-4x worse performance when intel mkl instructions are being used as opposed to M9 image. Note : even though M9 image is for gpu , i turned of cuda device visibility and benchmarking only cpu inference. Same observation made on another linux box with pip install of tf 1.11 in a clean virtualenv.
Any insights on how to debug or get the maximum out of the intel mkl library.
Upvotes: 1
Views: 626
Reputation: 918
This behavior has been fixed in M16+ (that has TF 1.12).
Upvotes: 0