Melissa Stockman
Melissa Stockman

Reputation: 349

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

We have a Vertex AI model that takes a relatively long time to return a prediction.

When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).

I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.

I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:

model_parameters = dict(batch_size=4)

def run_batch_prediction_job(vertex_config):

    aiplatform.init(
        project=vertex_config.vertex_project, location=vertex_config.location
    )

    model = aiplatform.Model(vertex_config.model_resource_name)

    model_params = dict(batch_size=4)
    batch_params = dict(
        job_display_name=vertex_config.job_display_name,
        gcs_source=vertex_config.gcs_source,
        gcs_destination_prefix=vertex_config.gcs_destination,
        machine_type=vertex_config.machine_type,
        accelerator_count=vertex_config.accelerator_count,
        accelerator_type=vertex_config.accelerator_type,
        starting_replica_count=replica_count,
        max_replica_count=replica_count,
        sync=vertex_config.sync,
        model_parameters=model_params
    )

    batch_prediction_job = model.batch_predict(**batch_params)

    batch_prediction_job.wait()

    return batch_prediction_job

I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?

Is there another way to decrease the number of instances sent to the model? Or is there a way to increase the timeout? Is there log output I can use to help figure this out? Thanks

Upvotes: 2

Views: 1261

Answers (1)

Ricco D
Ricco D

Reputation: 7287

Answering your follow up question above.

  1. Is that timeout for a single instance request or a batch request. Also, is it in seconds?

    This is a timeout for the batch job creation request.

    The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.

    timeout (float): The amount of time in seconds to wait for the RPC
                 to complete. Note that if ``retry`` is used, this timeout
                 applies to each individual attempt and the overall time it
                 takes for this method to complete may be longer. If
                 unspecified, the the default timeout in the client
                 configuration is used. If ``None``, then the RPC method will
                 not time out.
    

What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

Upvotes: 0

Related Questions