juvchan
juvchan

Reputation: 6245

NVIDIA Triton vs TorchServe for SageMaker Inference

NVIDIA Triton vs TorchServe for SageMaker inference? When to recommend each?

Both are modern, production grade inference servers. TorchServe is the DLC default inference server for PyTorch models. Triton is also supported for PyTorch inference on SageMaker.

Anyone has a good comparison matrix for both?

Upvotes: 7

Views: 5805

Answers (3)

albertoperdomo2
albertoperdomo2

Reputation: 462

Actually, Biano AI made a great comparison between some of the most common serving platforms for AI models. You can check it out here.

The comparison is well explained, but from experience, once you have adopted the NVIDIA Triton workflow and you spend some time with the documentation, it is the best option for systems that require extreme fine tuning, at the model level but also at the system level (throughput and latency wise).

It is highly reliable, but you have to be proficient in managing services such as this one and it leaves on your side everything related to monitoring, alerting, etc.

One of the key advantages of NVIDIA Triton is the support for different backends, which not only allow you to write your own backends but has a Python backend to do pretty custom things that are somehow difficult with other platforms.

NVIDIA Triton integration with Mlflow for model deployment is also well thought and if you already use Mlflow/Databricks for managing your models and experiments, it's the logical choice.

Finally, the NVIDIA Triton performance analyzer is great to run benchmarks and find your best configuration for a given model, with very informative graphs and metrics.

One of the few downsides that I have found working with Triton is that it does not allow multiple versions of the same model to be loaded in the GPU, which makes breaking changes in the model architecture hard to deploy in regards to the apps that query that model.

Upvotes: 2

Because I don't have enough reputation for replying in comments, I write in answer. MME is Multi-model endpoints. MME enables sharing GPU instances behind an endpoint across multiple models and dynamically loads and unloads models based on the incoming traffic. You can read it further in this link

Upvotes: -3

Raghu Ramesha
Raghu Ramesha

Reputation: 484

Important notes to add here where both serving stacks differ:

TorchServe does not provide the Instance Groups feature that Triton does (that is, stacking many copies of the same model or even different models onto the same GPU). This is a major advantage for both realtime and batch use-cases, as the performance increase is almost proportional to the model replication count (i.e. 2 copies of the model get you almost twice the throughput and half the latency; check out a BERT benchmark of this here). Hard to match a feature that is almost like having 2+ GPU's for the price of one. if you are deploying PyTorch DL models, odds are you often want to accelerate them with GPU's. TensorRT (TRT) is a compiler developed by NVIDIA that automatically quantizes and optimizes your model graph, which represents another huge speed up, depending on GPU architecture and model. It is understandably so probably the best way of automatically optimizing your model to run efficiently on GPU's and make good use of TensorCores. Triton has native integration to run TensorRT engines as they're called (even automatically converting your model to a TRT engine via config file), while TorchServe does not (even though you can use TRT engines with it). There is more parity between both when it comes to other important serving features: both have dynamic batching support, you can define inference DAG's with both (not sure if the latter works with TorchServe on SageMaker without a big hassle), and both support custom code/handlers instead of just being able to serve a model's forward function.

Finally, MME on GPU (coming shortly) will be based on Triton, which is a valid argument for customers to get familiar with it so that they can quickly leverage this new feature for cost-optimization.

Bottom line I think that Triton is just as easy (if not easier) ot use, a lot more optimized/integrated for taking full advantage of the underlying hardware (and will be updated to keep being that way as newer GPU architectures are released, enabling an easy move to them), and in general blows TorchServe out of the water performance-wise when its optimization features are used in combination.

Upvotes: 8

Related Questions