santeko
santeko

Reputation: 520

Vertex AI Pipeline quota aiplatform.googleapis.com/restricted_image_training_tpu_v3_pod

I'm getting started with creating a tuned model. I've got my training data in a .jsonl file, uploaded to a bucket, everything checks out. I've run the tuning 3 times and every time it fails on step 7/8.

com.google.cloud.ai.platform.common.errors.AiPlatformException: code=RESOURCE_EXHAUSTED, message=The following quota metrics exceed quota limits: aiplatform.googleapis.com/restricted_image_training_tpu_v3_pod, cause=null; Failed to create custom job.Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962; Failed to create external task or refresh its state. Task:Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962; Failed to handle the pipeline task. Task: Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962

I followed the steps here: Vertax AI pipeline quota with no luck.

I searched the quotas and for the quota listed in the error message, it says I'm at 0%. enter image description here

It also shows no quotas are over 90%.

The docs say that these pipelines only run on us-central1, when I inspect the quota for restricted_image_training_tpu_v3_pod it says my quota is 0. I can request an increase to 1 but I would have thought the docs would mention you can't get started without that. enter image description here

Here's what the pipeline looks like: enter image description here

Upvotes: 2

Views: 871

Answers (1)

Soleign H.
Soleign H.

Reputation: 122

To add on kiran matthew's answer,

Since the model uses 64 cores of TPU v3, you may submit a quota increase request in multiples of 64 (eg. a multiplier of 64 (1 job 64, 2 concurrent jobs 128) under Restricted image training TPU V3 pod cores per region quota.

Upvotes: 0

Related Questions