Reputation: 53
I am working on a RAG app, where I use LLMs to analyze various documents. I'm looking to improve the UX by streaming responses in real time.
a snippet of my code:
params = SamplingParams(temperature=TEMPERATURE,
min_tokens=128,
max_tokens=1024)
llm = LLM(MODEL_NAME,
tensor_parallel_size=4,
dtype="half",
gpu_memory_utilization=0.5,
max_model_len=27_000)
message = SYSTEM_PROMPT + "\n\n" + f"Question: {question}\n\nDocument: {document}"
response = llm.generate(message, params)
In its current form, generate
method waits until the entire response is generated. I'd like to change this so that responses are streamed and displayed incrementally to the user, enhancing interactivity.
I was using vllm==0.5.0.post1
when I first wrote that code.
Does anyone have experience with implementing streaming for LLMs=Any
. Guidance or examples would be appreciated!
Upvotes: 0
Views: 2276
Reputation: 1
AsyncLLMEngine will help you.
You can also refer to vLLM's aip_server.py
Upvotes: 0