Speeding-up inference of T5-like model

Question

I am currently using a model called T0pp (https://huggingface.co/bigscience/T0pp) in production and would like to speed up inference.

I am running the following code on an on-demand EC2 g4dn.12xlarge instance (4 Nvidia T4 GPUs):

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("bigscience/T0pp")
model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp")

model.parallelize()

input_dict = tokenizer(generation_input.inputs, return_tensors="pt", padding=True)
inputs = input_dict.input_ids.to("cuda:0")
attention_mask = input_dict.attention_mask.to("cuda:0")
with torch.no_grad():
    outputs = model.generate(inputs, attention_mask=attention_mask)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

I wanted to know which alternative you would try in order to speed-up inference, and if you knew good tutorials to do so. The main alternatives I see to speed-up inference would be to use the underlying Pytorch models with:

ONNX
Deepspeed
or using fp16 instead of fp32 parameters (with the main drawback of losing some quality)

Would someone have experience in using these tools, and would know which is the best / simplest option?

All this is quite new for me, and I must admit I've been a bit lost in ONNX and Deepspeed tutorials.

PS:

I already tried SageMaker, but this is not working for huge models like T0pp (40Gb).
Batching speeds up things, allowing to go from 1-2 seconds for batch size 1, to 16 seconds for batch size 32. In an ideal world, even batch size 32 would be under 1 or 2 seconds.

dragon7 · Accepted Answer

Maybe you could try OpenVINO? It allows you to convert your model into Intermediate Representation (IR) and then run on the CPU with the FP16 support. OpenVINO is optimized for Intel hardware but it should work with any processor. I cannot guarantee your model will be faster on CPU than Nvidia GPU but it's worth giving it a try. Some NLP models are fast enough (like this BERT).

You can find a full tutorial on how to convert the PyTorch model here (FastSeg) and here (BERT). Some snippets below.

Install OpenVINO

The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.

pip install openvino-dev[pytorch,onnx]

Save your model to ONNX

OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.

dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)

Use Model Optimizer to convert ONNX model

The Model Optimizer is a command line tool that comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to OV format (aka IR), which is a default format for OpenVINO. It also changes the precision to FP16 (to further increase performance). The accuracy drop, in most cases, is insignificant. Run in command line:

mo --input_model "model.onnx" --input_shape "[1, 3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"

Run the inference on the CPU

The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")

# Get output layer
output_layer_ir = compiled_model_ir.output(0)

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

It's worth mentioning that Runtime can process the ONNX model directly. In that case, just skip the conversion (Model Optimizer) step and give onnx path to the read_model function.

Disclaimer: I work on OpenVINO.

Speeding-up inference of T5-like model

Answers (1)

Related Questions