Reputation: 675
vllm 0.4.1 llama3instruct70b FastApi served on gcp's Vertex AI
Calling for predictions after several times causes an assertion:
assert len(running_scheduled.prefill_seq_groups) == 0
Model is loaded as follows:
from vllm import LLM
...
self.model = LLM(
model="meta-llama/Meta-Llama-3-70B-Instruct",
tensor_parallel_size=8,
enable_prefix_caching=False,
max_model_len=4096,
download_dir="/dev/shm/cache/huggingface",
)
Upvotes: 0
Views: 745
Reputation: 675
Apparently the example above uses vllm model (LLM class) in async manner while it is intended that users will be utilizing AsyncLLMEngine
instead for concurrent use,
here's an example:
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.usage.usage_lib import UsageContext
engine_args = AsyncEngineArgs(
model=model_config.hf_model_path,
engine_use_ray=bool(model_config.tensor_parallel_size > 1),
...
)
self.model = AsyncLLMEngine.from_engine_args(engine_args,
usage_context=UsageContext.API_SERVER)
To generate a response follow their server implementation
Upvotes: 0