Tsvi Sabo
Tsvi Sabo

Reputation: 675

Ray error when trying to deploy Llama3 70b with VLLM with Vertex AI

Using Vertex ai custom container online predictions, i'm trying to deploy:

meta-llama/Meta-Llama-3-70B-Instruct

with vllm 0.4.1 on 8 NVIDIA_L4 gpus and gettings:

/tmp/ray is over 95% full, available space: 5031063552; capacity: 101203873792. Object creation will fail if spilling is required.

this is the last log i see and after that deployment is failed with no apparent reason, it seems like Vertex restarts the container but eventually it fails (probably due to timeout)

running the custom container on a VM had no issues,

To create the model i'm using google aiplatfrom sdk:

model_resource = aiplatform.Model.upload(
    serving_container_image_uri=serving_container_image_uri,
    serving_container_shared_memory_size_mb=16384,
    ...
    )

and to load the model with vllm (code ran by the container):

from vllm import LLM
self.model = LLM(
    model=model_config.model_hf_name,
    dtype="auto",
    tensor_parallel_size=model_config.tensor_parallel_size,
    enforce_eager=model_config.enforce_eager,
    disable_custom_all_reduce=model_config.disable_custom_all_reduce,
    worker_use_ray=bool(model_config.tensor_parallel_size > 1),
    enable_prefix_caching=False,
    max_model_len=model_config.max_seq_len,
)

Upvotes: 2

Views: 461

Answers (1)

Tsvi Sabo
Tsvi Sabo

Reputation: 675

Apparently, Vertex AI online prediction with custom container has a storage limitation for the running container,

So, need to set shared memory enough for inter-gpu vllm communication + model storage which is ~142gb+

  1. upload the model with enough storage/shared memory
model_resource = aiplatform.Model.upload(
    serving_container_image_uri=serving_container_image_uri,
    serving_container_shared_memory_size_mb=240000,
    ...
    )
  1. Point vllm and ray (vllm's cluster management dependency) to /dev/shm to avoid storage exceptions
    model_path = f"/dev/shm/new_model_path"
    ray_tmp_dir = "/dev/shm/tmp/ray"
    os.makedirs(ray_tmp_dir, exist_ok=True)
    ray.init(_temp_dir=ray_tmp_dir,num_gpus=model_config.tensor_parallel_size)

And download the model too under /dev/shm/

    engine_args = AsyncEngineArgs(
        model=model_config.hf_model_path,
        download_dir="/dev/shm/cache/huggingface",
    )

    self.model = AsyncLLMEngine.from_engine_args(engine_args, usage_context=UsageContext.API_SERVER

Upvotes: 2

Related Questions