Bar-Levav
Bar-Levav

Reputation: 181

Properly configuring use_IO_bindings in ONNX ORTModelForSequenceClassification to improve inference speed on GPU

I'm currently running into an issue where running the changes (in green) in the following diff leads to worse performance on GPU (it adds an additional 100ms/request on average, with significantly higher p(90) and above) and I'm trying to understand why this might be the case.

Below is a diff comparing the approach I thought would be better (in green/the additions) from the current situation (in red/the substitutions).

The diff contains the inference pipelines for an ONNX model (a fine-tuned DeBerta model) that I'm trying to run in production on a GPU.

@@ -53,14 +53,13 @@ class Scanner():
         """
         super().__init__(config)
         self.tokenizer = AutoTokenizer.from_pretrained(self.ONNX_PATH)
+        self.device = get_device()
         self.model = ORTModelForSequenceClassification.from_pretrained(
             self.ONNX_PATH,
             local_files_only=True,
+            provider="CUDAExecutionProvider" if self.device.type == "cuda" else "CPUExecutionProvider",
+            use_io_binding=self.device.type == "cuda",  # This is already occurring by default, but better to be explicit
         )
-        self.device = get_device()
-        self.model = self.model.to(self.device)
-        # Disable IO binding since we're using numpy inputs
-        self.model.use_io_binding = False
 
     async def analyze(self, content, **kwargs) -> AnalysisResult:
         """
@@ -95,24 +94,37 @@ class Scanner():
        loop = asyncio.get_event_loop()

        def run_inference():
             By running inference in this manner, we optimize throughput while effectively
             managing resource usage through a global scanning queue. This approach is particularly
             beneficial for applications requiring high responsiveness and scalability.
+
+            INFERENCE NOTE:
+            This inference follows the same inference instructions as set out by the prompt guard tutorial:
+            https://github.com/meta-llama/llama-cookbook/blob/main/end-to-end-use-cases/responsible_ai/prompt_guard/prompt_guard_tutorial.ipynb
+
+            One thing to note is that the model here is an ONNX model, rather than the PyTorch model used in the tutorial.
+            As a result, there are some small alterations from the promp_guard_tutorial.
+            For example, we have made explicit the use of IO binding when running on GPU (see init) in order to improve
+            performance.
             """
+            # Temperature scaling is about calibrating the confidence of an already trained model.
+            # It adjusts how "sharp" or "soft" the probability distributions are in the model's outputs.
+            # A temperature >1 makes the distribution more uniform (less confident), while <1 makes it more peaked (more confident).
+            # Adding temperature scaling to reflect prompt guard tutorial, however keeping it set to 1.0 (no impact) for now.
+            temperature = 1.0
             # Tokenize input and prepare for model
-            inputs = self.tokenizer(content, return_tensors="pt", truncation=True, max_length=512)
-            # Move inputs to same device as model for consistent processing
-            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+            # We're not padding here since we're using a batch size of 1; one scan per prompt (for now).
+            # We should revisit this if we start running batch scans.
+            inputs = self.tokenizer(content, return_tensors="pt", padding=False, truncation=True, max_length=512)
 
-            # Convert inputs to numpy arrays for ONNX Runtime
-            # Ensure tensors are moved to CPU (v.cpu()) before converting to Numpy arrays
-            # ONNX Runtime requires inputs in Numpy format
-            inputs = {k: v.cpu().numpy() for k, v in inputs.items()}
-            # Run inference
-            outputs = self.model(**inputs)
+            # Get logits from the model, make sure to disable gradients since we don't need to backpropagate
+            with torch.no_grad():
+                logits = self.model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]).logits
 
+            # Apply temperature scaling (no impact for now)
+            scaled_logits = logits / temperature
             # Process the outputs
-            logits = outputs.logits
-            prediction = torch.nn.functional.softmax(torch.from_numpy(logits), dim=1)
+            prediction = torch.nn.functional.softmax(scaled_logits, dim=1)
+            probabilities = prediction[0]
             predicted_class = prediction.argmax().item()
-            confidence = prediction[0][predicted_class].item()
+            confidence = probabilities[predicted_class].item()
 
             return predicted_class, confidence
        
        # Run compute-intensive operations in thread pool
        predicted_class, confidence = await loop.run_in_executor(None, run_inference)

        findings = []
        if LABEL_MAPPING[str(predicted_class)] == "TESTER" and confidence > 0.8:
            findings.append(self._create_finding(content))

In the diff, I include the inference method call. This model represent but one of a number of scanners that we're running in python co-routines (hence trying to run_in_executor) to run in the thread pool.

My first inclination was that perhaps this is an issue with setting padding to be false. I double checked -- the ONNX model isn't expecting fixed input sizes, so I don't think this is the case.

My current hypothesis (and the major change in the diff) is that there is an issue with the use IO bindings -- I thought that it would have improved performance, but perhaps I've configured it poorly?

Any guidance to reason better about this would be much appreciated! Thanks!!

Upvotes: 0

Views: 32

Answers (0)

Related Questions