Reputation: 406
I currently have a triton server with a python backend that serves a model. The machine I am running the inference on is a g4dn.xlarge machine. The instance count provided for the GPU in the config.pbtxt is varied between 1 to 3.
I am using perf_analyzer to see if my model scales well for concurrent requests but I get the following results when instance_count is set to 3:
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 72.2694 infer/sec, latency 13838 usec
Concurrency: 2, throughput: 85.3758 infer/sec, latency 23419 usec
Concurrency: 3, throughput: 91.5349 infer/sec, latency 32754 usec
From the above results, its seen that the average throughput does increase when the instance count is set to 3, but does not scale linearly. Using nvidia-smi, I was able to verify that the GPU capacity is not utilized completely (goes upto 70%).
I have a couple of questions:
Is it possible to load the model into GPU memory once, and do inference on multiple threads/process which share the single model copy
Does triton have capability to automatically to batch requests that arrive at the same time individually from multiple clients?
Upvotes: 2
Views: 663