Reputation: 222
I want to perform offline batch inference with a model that is too large to fit into one GPU. I want to use tensor parallelism for this. Previously I have used vLLM
for batch inference. However, now I have a custom model that does not fit into vLLM
's offered architecture.
My whole stack is build on top of ray
, so I would like to distribute tensor shards across workers in ray
and perform inference. So far, it seems using the plain map_batches
API, the workers would replicate a entire model on each worker, which will yield OOM. This for example done in this tutorial:
Now what is the best workflow to run batch inference for a custom model using parallelism (or any other technique that avoids fitting the entire model on one gpu) ?
Upvotes: 0
Views: 163