Serve concurrent requests with NVIDIA Triton on a GPU

Question

I currently have a triton server with a python backend that serves a model. The machine I am running the inference on is a g4dn.xlarge machine. The instance count provided for the GPU in the config.pbtxt is varied between 1 to 3.

I am using perf_analyzer to see if my model scales well for concurrent requests but I get the following results when instance_count is set to 3:

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 72.2694 infer/sec, latency 13838 usec
Concurrency: 2, throughput: 85.3758 infer/sec, latency 23419 usec
Concurrency: 3, throughput: 91.5349 infer/sec, latency 32754 usec

From the above results, its seen that the average throughput does increase when the instance count is set to 3, but does not scale linearly. Using nvidia-smi, I was able to verify that the GPU capacity is not utilized completely (goes upto 70%).

I have a couple of questions:

Is it possible to load the model into GPU memory once, and do inference on multiple threads/process which share the single model copy
Does triton have capability to automatically to batch requests that arrive at the same time individually from multiple clients?

Serve concurrent requests with NVIDIA Triton on a GPU

Answers (0)

Related Questions