Stream output using VLLM

Question

I am working on a RAG app, where I use LLMs to analyze various documents. I'm looking to improve the UX by streaming responses in real time.
a snippet of my code:

params = SamplingParams(temperature=TEMPERATURE, 
                        min_tokens=128, 
                        max_tokens=1024)
llm = LLM(MODEL_NAME, 
          tensor_parallel_size=4, 
          dtype="half", 
          gpu_memory_utilization=0.5, 
          max_model_len=27_000)
    
message = SYSTEM_PROMPT + "

" + f"Question: {question}

Document: {document}"
    
response = llm.generate(message, params)

In its current form, generate method waits until the entire response is generated. I'd like to change this so that responses are streamed and displayed incrementally to the user, enhancing interactivity.

I was using vllm==0.5.0.post1 when I first wrote that code.

Does anyone have experience with implementing streaming for LLMs=Any. Guidance or examples would be appreciated!

Stream output using VLLM

Answers (1)

Related Questions