Ajayv
Ajayv

Reputation: 406

Serve concurrent requests with NVIDIA Triton on a GPU

I currently have a triton server with a python backend that serves a model. The machine I am running the inference on is a g4dn.xlarge machine. The instance count provided for the GPU in the config.pbtxt is varied between 1 to 3.

I am using perf_analyzer to see if my model scales well for concurrent requests but I get the following results when instance_count is set to 3:

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 72.2694 infer/sec, latency 13838 usec
Concurrency: 2, throughput: 85.3758 infer/sec, latency 23419 usec
Concurrency: 3, throughput: 91.5349 infer/sec, latency 32754 usec

From the above results, its seen that the average throughput does increase when the instance count is set to 3, but does not scale linearly. Using nvidia-smi, I was able to verify that the GPU capacity is not utilized completely (goes upto 70%).

I have a couple of questions:

  1. Is it possible to load the model into GPU memory once, and do inference on multiple threads/process which share the single model copy

  2. Does triton have capability to automatically to batch requests that arrive at the same time individually from multiple clients?

Upvotes: 2

Views: 663

Answers (0)

Related Questions