Why does the warmup process use only 1 CPU core when loading a model in TensorFlow Serving? How can this be fixed?

Question

We are facing an issue where loading a new version of a model in TensorFlow Serving takes a long time during the warmup process (in our case, 10 minutes). This is a problem for us.

While investigating the issue, we found that the warmup process uses only 1 CPU core. We tried disabling warmup, but then the first request to the model also takes 10 minutes and still uses only 1 CPU core. However, subsequent requests (whether with or without warmup) use all available CPU cores.

We are aware of lazy loading, but our question is:

Can we speed this up?
How can we ensure that all CPU cores are utilized during the warmup/first request?

We have tried different TensorFlow Serving parameters, including tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism, as well as ModelWarmupOptions.num_model_warmup_threads, but this did not help.

We also observed that the issue might not be specific to TensorFlow Serving. When loading the model via a script and running the first request, it also uses only 1 CPU core and takes a long time.

We tested this on versions 2.5, 2.15, and 2.17.

Why does the warmup process use only 1 CPU core when loading a model in TensorFlow Serving? How can this be fixed?

Answers (0)

Related Questions