Running Local LLMs in Production and handling multiple requests

I am trying to run a RAG with Gemma LLM locally it is running fine but the idea is I can't handle more than one request at a time.

Is there a way to handle concurrent requests with utilizing resources efficiently? Can cont batching help and how to write it in the code?

Used in the code: llamaindex Llamacpp pinecone flask

Upvotes: 2

Answers (1)

mirekphd

Reputation: 6831

Try using vLLM instead of Llama.cpp (which is not thread-safe).

vLLM is able to handle in parallel concurrent (overlapping) requests and keeps up with the most recent models (even ones released in the previous week, like Llama 3 and Mixtral 8x22B currently are).

Upvotes: 2

Running Local LLMs in Production and handling multiple requests

Answers (1)

Related Questions