Reputation: 23
I want to deploy my ONNX model using Triton. Here is my model configuration, which works fine when using one specified GPU.
name: "yolox"
platform: "onnxruntime_onnx"
max_batch_size: 2
dynamic_batching {
max_queue_delay_microseconds: 100
}
instance_group [
{
count: 4
kind: KIND_GPU
gpus: [ 0 ]
}
]
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 800, 800, 3 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [ 25, 100, 100 ]
},
{
name: "output1"
data_type: TYPE_FP32
dims: [ 25, 50, 50 ]
},
{
name: "output2"
data_type: TYPE_FP32
dims: [ 25, 25, 25 ]
}
]
optimization { execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "precision_mode" value: "FP32" }
parameters { key: "max_workspace_size_bytes" value: "1073741824" }
}]
}}
model_warmup [
{
name: "random_input"
batch_size: 1
inputs: {
key: "input"
value: {
data_type: TYPE_FP32
dims: [800, 800, 3]
random_data: true
}
}
}
]
However, when I change the configuration to run on two GPUs,
instance_group [
{
count: 4
kind: KIND_GPU
gpus: [ 0,1 ]
}
]
I encounter the following error when starting the Triton server.
2024-07-24 01:58:32.501117150 [E:onnxruntime:log, tensorrt_execution_provider.h:82 log] [2024-07-24 01:58:32 ERROR] IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (an illegal memory access was encountered)
2024-07-24 01:58:32.501200757 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch-jit-export_9618487497305295762_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9618487497305295762_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-07-24 01:58:32.501266856 [E:onnxruntime:log, cuda_call.cc:118 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=bac6b01d263a ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=446 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(stream_));
I0724 01:58:32.501728 148 onnxruntime.cc:3127] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
2024-07-24 01:58:32.513692921 [E:onnxruntime:log, tensorrt_execution_provider.h:82 log] [2024-07-24 01:58:32 ERROR] [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:432] Error 700 destroying stream '0x7fd820961730'.)
2024-07-24 01:58:32.520108393 [E:onnxruntime:log, tensorrt_execution_provider.h:82 log] [2024-07-24 01:58:32 ERROR] [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:432] Error 700 destroying stream '0x7fd816ae4370'.)
How can I fix this issue?
My environment is configured as follows:
ONNX v6, opset 11
cuda_12.5
triton: 24.06
Driver Version: 535.54.03
I tried removing the warm-up phase, and the Triton server starts up normally. However, as soon as I send a request to the server, the same error occurs, and it crashes.
The reason I'm using TensorRT for optimization is that directly using ONNX Runtime as the backend leads to excessive GPU memory consumption, ultimately resulting in an out-of-memory error.
Upvotes: 0
Views: 202