Reputation: 21
I am trying to run a RAG with Gemma LLM locally it is running fine but the idea is I can't handle more than one request at a time.
Is there a way to handle concurrent requests with utilizing resources efficiently? Can cont batching help and how to write it in the code?
Used in the code: llamaindex Llamacpp pinecone flask
Upvotes: 2
Views: 2462
Reputation: 6831
Try using vLLM
instead of Llama.cpp
(which is not thread-safe).
vLLM
is able to handle in parallel concurrent (overlapping) requests and keeps up with the most recent models (even ones released in the previous week, like Llama 3
and Mixtral 8x22B
currently are).
Upvotes: 2