How to do distributed batch inference using tensor parallelism with Ray?

Question

I want to perform offline batch inference with a model that is too large to fit into one GPU. I want to use tensor parallelism for this. Previously I have used vLLM for batch inference. However, now I have a custom model that does not fit into vLLM's offered architecture.

My whole stack is build on top of ray, so I would like to distribute tensor shards across workers in ray and perform inference. So far, it seems using the plain map_batches API, the workers would replicate a entire model on each worker, which will yield OOM. This for example done in this tutorial:

https://medium.com/kocdigital/scalable-batch-inference-on-large-language-models-using-ray-ac3bf1cf1384

Now what is the best workflow to run batch inference for a custom model using parallelism (or any other technique that avoids fitting the entire model on one gpu) ?

How to do distributed batch inference using tensor parallelism with Ray?

Answers (0)

Related Questions