Tsvi Sabo
Tsvi Sabo

Reputation: 675

Llama3 70b with VLLM on "assert len(running_scheduled.prefill_seq_groups) == 0" when waiting for prediction

vllm 0.4.1 llama3instruct70b FastApi served on gcp's Vertex AI

Calling for predictions after several times causes an assertion:

assert len(running_scheduled.prefill_seq_groups) == 0

Model is loaded as follows:

from vllm import LLM
        ...
        self.model = LLM(
        model="meta-llama/Meta-Llama-3-70B-Instruct",
        tensor_parallel_size=8,
        enable_prefix_caching=False,
        max_model_len=4096,
        download_dir="/dev/shm/cache/huggingface",
    )

Upvotes: 0

Views: 745

Answers (1)

Tsvi Sabo
Tsvi Sabo

Reputation: 675

Apparently the example above uses vllm model (LLM class) in async manner while it is intended that users will be utilizing AsyncLLMEngine instead for concurrent use,

here's an example:

   from vllm.engine.async_llm_engine import AsyncLLMEngine
   from vllm.engine.arg_utils import AsyncEngineArgs
   from vllm.usage.usage_lib import UsageContext

   engine_args = AsyncEngineArgs(
        model=model_config.hf_model_path,
        engine_use_ray=bool(model_config.tensor_parallel_size > 1),
        ...
    )

    self.model = AsyncLLMEngine.from_engine_args(engine_args, 
    usage_context=UsageContext.API_SERVER)

To generate a response follow their server implementation

Upvotes: 0

Related Questions