Cihan Yalçın
Cihan Yalçın

Reputation: 53

Stream output using VLLM

I am working on a RAG app, where I use LLMs to analyze various documents. I'm looking to improve the UX by streaming responses in real time.
a snippet of my code:

params = SamplingParams(temperature=TEMPERATURE, 
                        min_tokens=128, 
                        max_tokens=1024)
llm = LLM(MODEL_NAME, 
          tensor_parallel_size=4, 
          dtype="half", 
          gpu_memory_utilization=0.5, 
          max_model_len=27_000)
    
message = SYSTEM_PROMPT + "\n\n" + f"Question: {question}\n\nDocument: {document}"
    
response = llm.generate(message, params)

In its current form, generate method waits until the entire response is generated. I'd like to change this so that responses are streamed and displayed incrementally to the user, enhancing interactivity.

I was using vllm==0.5.0.post1 when I first wrote that code.

Does anyone have experience with implementing streaming for LLMs=Any. Guidance or examples would be appreciated!

Upvotes: 0

Views: 2276

Answers (1)

Eddy Qian
Eddy Qian

Reputation: 1

AsyncLLMEngine will help you.

You can also refer to vLLM's aip_server.py

Upvotes: 0

Related Questions