Reputation: 181
I'm currently running into an issue where running the changes (in green) in the following diff leads to worse performance on GPU (it adds an additional 100ms/request on average, with significantly higher p(90) and above) and I'm trying to understand why this might be the case.
Below is a diff comparing the approach I thought would be better (in green/the additions) from the current situation (in red/the substitutions).
The diff contains the inference pipelines for an ONNX model (a fine-tuned DeBerta model) that I'm trying to run in production on a GPU.
@@ -53,14 +53,13 @@ class Scanner():
"""
super().__init__(config)
self.tokenizer = AutoTokenizer.from_pretrained(self.ONNX_PATH)
+ self.device = get_device()
self.model = ORTModelForSequenceClassification.from_pretrained(
self.ONNX_PATH,
local_files_only=True,
+ provider="CUDAExecutionProvider" if self.device.type == "cuda" else "CPUExecutionProvider",
+ use_io_binding=self.device.type == "cuda", # This is already occurring by default, but better to be explicit
)
- self.device = get_device()
- self.model = self.model.to(self.device)
- # Disable IO binding since we're using numpy inputs
- self.model.use_io_binding = False
async def analyze(self, content, **kwargs) -> AnalysisResult:
"""
@@ -95,24 +94,37 @@ class Scanner():
loop = asyncio.get_event_loop()
def run_inference():
By running inference in this manner, we optimize throughput while effectively
managing resource usage through a global scanning queue. This approach is particularly
beneficial for applications requiring high responsiveness and scalability.
+
+ INFERENCE NOTE:
+ This inference follows the same inference instructions as set out by the prompt guard tutorial:
+ https://github.com/meta-llama/llama-cookbook/blob/main/end-to-end-use-cases/responsible_ai/prompt_guard/prompt_guard_tutorial.ipynb
+
+ One thing to note is that the model here is an ONNX model, rather than the PyTorch model used in the tutorial.
+ As a result, there are some small alterations from the promp_guard_tutorial.
+ For example, we have made explicit the use of IO binding when running on GPU (see init) in order to improve
+ performance.
"""
+ # Temperature scaling is about calibrating the confidence of an already trained model.
+ # It adjusts how "sharp" or "soft" the probability distributions are in the model's outputs.
+ # A temperature >1 makes the distribution more uniform (less confident), while <1 makes it more peaked (more confident).
+ # Adding temperature scaling to reflect prompt guard tutorial, however keeping it set to 1.0 (no impact) for now.
+ temperature = 1.0
# Tokenize input and prepare for model
- inputs = self.tokenizer(content, return_tensors="pt", truncation=True, max_length=512)
- # Move inputs to same device as model for consistent processing
- inputs = {k: v.to(self.device) for k, v in inputs.items()}
+ # We're not padding here since we're using a batch size of 1; one scan per prompt (for now).
+ # We should revisit this if we start running batch scans.
+ inputs = self.tokenizer(content, return_tensors="pt", padding=False, truncation=True, max_length=512)
- # Convert inputs to numpy arrays for ONNX Runtime
- # Ensure tensors are moved to CPU (v.cpu()) before converting to Numpy arrays
- # ONNX Runtime requires inputs in Numpy format
- inputs = {k: v.cpu().numpy() for k, v in inputs.items()}
- # Run inference
- outputs = self.model(**inputs)
+ # Get logits from the model, make sure to disable gradients since we don't need to backpropagate
+ with torch.no_grad():
+ logits = self.model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]).logits
+ # Apply temperature scaling (no impact for now)
+ scaled_logits = logits / temperature
# Process the outputs
- logits = outputs.logits
- prediction = torch.nn.functional.softmax(torch.from_numpy(logits), dim=1)
+ prediction = torch.nn.functional.softmax(scaled_logits, dim=1)
+ probabilities = prediction[0]
predicted_class = prediction.argmax().item()
- confidence = prediction[0][predicted_class].item()
+ confidence = probabilities[predicted_class].item()
return predicted_class, confidence
# Run compute-intensive operations in thread pool
predicted_class, confidence = await loop.run_in_executor(None, run_inference)
findings = []
if LABEL_MAPPING[str(predicted_class)] == "TESTER" and confidence > 0.8:
findings.append(self._create_finding(content))
In the diff, I include the inference method call. This model represent but one of a number of scanners that we're running in python co-routines (hence trying to run_in_executor) to run in the thread pool.
My first inclination was that perhaps this is an issue with setting padding to be false. I double checked -- the ONNX model isn't expecting fixed input sizes, so I don't think this is the case.
My current hypothesis (and the major change in the diff) is that there is an issue with the use IO bindings -- I thought that it would have improved performance, but perhaps I've configured it poorly?
Any guidance to reason better about this would be much appreciated! Thanks!!
Upvotes: 0
Views: 32