Reputation: 31
I have created a Monte-Carlo simulation model implemented in Tensorflow 2.5. The model mostly consists of vector multiplications inside a tf.while_loop
. I am benchmarking the performance on a Linux machine with 8 virtual CPUs. When I run the model in graph mode (without XLA optimization), the model fully utilizes all 8 CPUs (I can see the %CPU to be close to 800% using the top
command). However, when I run model after compiling with XLA (by using jit_compile=True
inside @tf.function
decorator), I can see the %CPU utilization to be close to 250%. Is there a way to force Tensorflow to utilize all available CPU capacity with XLA.
I have experimented with the changing the inter_op_parallelism
and intra_op_parallelism
settings. While setting both of the threads settings to 1 reduces the CPU utilization from 250% to 100%, increasing them to 8 doesn't increase the utilization beyond 250%.
Any help and suggestions on what might be going on?
Upvotes: 1
Views: 500
Reputation: 1
I had the same question. Using the suggestions found here: https://www.tensorflow.org/xla I modified the JIT compile sequence for my ML model to something like
os.environ['XLA_FLAGS'] = '--xla_dump_to=/tmp/dump'
@tf.function(jit_compile=True)
def foo(data):
return model(data)
This produces an object (*.o) file in /tmp/dump
which I disassembled with objdump -d
. Looking at the disassembly, it appears that the compiler has generated straight-line code for the model and computational kernels rather than calling out to libraries that might support parallel execution. I don't see anything that suggests the possibility of parallel execution of this JIT-ted model, although like you I do observe parallel execution when I simply call the model.
However, for me the best performance for this particular model comes from using @tf.function()
with jit_compile=False
. In this case I observe 'intra_op' parallelism happening - but no 'inter_op' parallelism which is also observed when simply calling the model.
Upvotes: 0