Aitar
Aitar

Reputation: 23

"CUDA failure 700" when using onnxruntime backend optimizated with TensorRT in Triton

I want to deploy my ONNX model using Triton. Here is my model configuration, which works fine when using one specified GPU.

name: "yolox"
platform: "onnxruntime_onnx"
max_batch_size: 2
dynamic_batching {
    max_queue_delay_microseconds: 100
}
instance_group [
    {
      count: 4
      kind: KIND_GPU
      gpus: [ 0 ]
    }
]
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 800, 800, 3 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [ 25, 100, 100 ]
  },
  {
    name: "output1"
    data_type: TYPE_FP32
    dims: [ 25, 50, 50 ]
  },
  {
    name: "output2"
    data_type: TYPE_FP32
    dims: [ 25, 25, 25 ]
  }
]
optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP32" }
    parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    }]
}}
model_warmup  [
  {
    name: "random_input"
    batch_size: 1
    inputs: {
      key: "input"
      value: {
        data_type: TYPE_FP32
        dims: [800, 800, 3]
        random_data: true
      }
    }
  }
]

However, when I change the configuration to run on two GPUs,

instance_group [
    {
      count: 4
      kind: KIND_GPU
      gpus: [ 0,1 ]
    }
]

I encounter the following error when starting the Triton server.

2024-07-24 01:58:32.501117150 [E:onnxruntime:log, tensorrt_execution_provider.h:82 log] [2024-07-24 01:58:32   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (an illegal memory access was encountered)
2024-07-24 01:58:32.501200757 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch-jit-export_9618487497305295762_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9618487497305295762_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-07-24 01:58:32.501266856 [E:onnxruntime:log, cuda_call.cc:118 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=bac6b01d263a ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=446 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(stream_)); 
I0724 01:58:32.501728 148 onnxruntime.cc:3127] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
2024-07-24 01:58:32.513692921 [E:onnxruntime:log, tensorrt_execution_provider.h:82 log] [2024-07-24 01:58:32   ERROR] [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:432] Error 700 destroying stream '0x7fd820961730'.)
2024-07-24 01:58:32.520108393 [E:onnxruntime:log, tensorrt_execution_provider.h:82 log] [2024-07-24 01:58:32   ERROR] [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:432] Error 700 destroying stream '0x7fd816ae4370'.)

How can I fix this issue?

My environment is configured as follows:

I tried removing the warm-up phase, and the Triton server starts up normally. However, as soon as I send a request to the server, the same error occurs, and it crashes.

The reason I'm using TensorRT for optimization is that directly using ONNX Runtime as the backend leads to excessive GPU memory consumption, ultimately resulting in an out-of-memory error.

Upvotes: 0

Views: 202

Answers (0)

Related Questions