Reputation: 1608
I serve multiple models in one process, and each model create a Tensorflow session. Let say there are 8 model, so 8 tf.session been created.
And I follow Optimizing for CPU, Tuning_mkl_for_the_best_performance to open MKL. My machine have 8 core and 2 threads. I set each tf.session as below.
config = tf.ConfigProto()
config.intra_op_parallelism_threads = 8
config.inter_op_parallelism_threads = 1
tf.Session(config=config)
Also set
OMP_NUM_THREADS=8,
KMP_BLOCKTIME=1;
KMP_AFFINITY='granularity=fine,verbose,compact,1,0';
KMP_SETTINGS=1
However, it may cause Cpu over-subscript, the Golang process create 87 threads. Does something wrong in my setting?
Here is the log from OMP.
OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 2,4,5,12,17,27,32,47
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 32 maps to package 0 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 5
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 8
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 5
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 9
OMP: Info #171: KMP_AFFINITY: OS proc 47 maps to package 1 core 11
OMP: Info #250: KMP_AFFINITY: pid 832 tid 946 thread 0 bound to OS proc set 4
OMP: Info #250: KMP_AFFINITY: pid 832 tid 946 thread 1 bound to OS proc set 5
OMP: Info #250: KMP_AFFINITY: pid 832 tid 945 thread 2 bound to OS proc set 32
OMP: Info #250: KMP_AFFINITY: pid 832 tid 1583 thread 3 bound to OS proc set 27
OMP: Info #250: KMP_AFFINITY: pid 832 tid 1584 thread 4 bound to OS proc set 2
OMP: Info #250: KMP_AFFINITY: pid 832 tid 1585 thread 5 bound to OS proc set 17
OMP: Info #250: KMP_AFFINITY: pid 832 tid 1586 thread 6 bound to OS proc set 12
OMP: Info #250: KMP_AFFINITY: pid 832 tid 1587 thread 7 bound to OS proc set 47
OMP: Info #250: KMP_AFFINITY: pid 832 tid 1590 thread 10 bound to OS proc set 32
OMP: Info #250: KMP_AFFINITY: pid 832 tid 1589 thread 9 bound to OS proc set 5
OMP: Info #250: KMP_AFFINITY: pid 832 tid 1591 thread 11 bound to OS proc set 27
OMP: Info #250: KMP_AFFINITY: pid 832 tid 1588 thread 8 bound to OS proc set 4
OMP: Info #250: KMP_AFFINITY: pid 832 tid 2120 thread 13 bound to OS proc set 17
OMP: Info #250: KMP_AFFINITY: pid 832 tid 2122 thread 15 bound to OS proc set 47
OMP: Info #250: KMP_AFFINITY: pid 832 tid 2123 thread 16 bound to OS proc set 4
OMP: Info #250: KMP_AFFINITY: pid 832 tid 2121 thread 14 bound to OS proc set 12
OMP: Info #250: KMP_AFFINITY: pid 832 tid 2119 thread 12 bound to OS proc set 2
Upvotes: 2
Views: 715
Reputation: 648
From the available information, looks like you have 16 (8x2) physical cores and 32 (8x2x2) logical cores. Recommended settings of 'intra_op_parallelism_threads' is equal to physical cores and 'inter_op_parallelism_threads' is equal to the number of sockets.
In your case, assuming 8 models at a time, I would suggest you to try out with the following configurations.
config = tf.ConfigProto()
config.intra_op_parallelism_threads = 2
config.inter_op_parallelism_threads = 2
tf.Session(config=config)
and
OMP_NUM_THREADS=2,
KMP_BLOCKTIME=1;
KMP_AFFINITY='granularity=fine,verbose,compact,1,0';
KMP_SETTINGS=1
Also try 'config.intra_op_parallelism_threads=1' and 'OMP_NUM_THREADS=1'.
For more details, you can refer https://software.intel.com/en-us/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference
Hope this helps.
Upvotes: 2