haju
haju

Reputation: 241

how to host/invoke multiple models in nvidia triton server for inference?

based on documentation here, https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/nlp/realtime/triton/multi-model/bert_trition-backend/bert_pytorch_trt_backend_MME.ipynb, I have set up a multi model utilizing gpu instance type and nvidia triton container. looking at the set up in the link, the model is invoked by passing tokens instead of passing text directly to the model. is it possible to pass text directly to the model, given the input type is set to string data type in the config.pbtxt (sample code below) . looking for any examples around this.

config.pbtxt

name: "..."
platform: "..."
max_batch_size : 0
input [
  {
    name: "INPUT_0"
    data_type: TYPE_STRING
    ...
  }
]
output [
  {
    name: "OUTPUT_1"
    ....
  }
]

multi-model invocation



text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
input_ids, attention_mask = tokenize_text(text_triton)

payload = {
    "inputs": [
        {"name": "token_ids", "shape": [1, 128], "datatype": "INT32", "data": input_ids},
        {"name": "attn_mask", "shape": [1, 128], "datatype": "INT32", "data": attention_mask},
    ]
}

    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel=f"bert-{i}.tar.gz",
    )

Upvotes: 1

Views: 1413

Answers (1)

Marc Karp
Marc Karp

Reputation: 1314

If you want you could make use of an ensemble model in Triton where the first model tokenizes the text and passes it onto the model.

Take a look at this link that describes the strategy: https://blog.ml6.eu/triton-ensemble-model-for-deploying-transformers-into-production-c0f727c012e3

Upvotes: 2

Related Questions