How to share a single GPU to infer with several models

Question

Suppose you have a server with powerful CPUs and a lot of RAM. On this server, you have also one GPU with a limited amount of VRAM. Is it possible to infer on this server with several LLMs, each being used a limited amount of time each day? The idea would be to transfer a model on VRAM on-demand when it is used and possibly infer on CPU when the GPU is used by another model, transferring it to VRAM when the GPU becomes available. Is this scenario feasible? Is there already libs allowing it?

I searched on the internet without finding a solution. I found only answers about running models on CPU or optimizing the use of GPU.

How to share a single GPU to infer with several models

Answers (0)

Related Questions