Reputation: 520
I'm getting started with creating a tuned model. I've got my training data in a .jsonl file, uploaded to a bucket, everything checks out. I've run the tuning 3 times and every time it fails on step 7/8.
com.google.cloud.ai.platform.common.errors.AiPlatformException: code=RESOURCE_EXHAUSTED, message=The following quota metrics exceed quota limits: aiplatform.googleapis.com/restricted_image_training_tpu_v3_pod, cause=null; Failed to create custom job.Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962; Failed to create external task or refresh its state. Task:Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962; Failed to handle the pipeline task. Task: Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962
I followed the steps here: Vertax AI pipeline quota with no luck.
I searched the quotas and for the quota listed in the error message, it says I'm at 0%.
It also shows no quotas are over 90%.
The docs say that these pipelines only run on us-central1, when I inspect the quota for restricted_image_training_tpu_v3_pod
it says my quota is 0. I can request an increase to 1 but I would have thought the docs would mention you can't get started without that.
Here's what the pipeline looks like:
Upvotes: 2
Views: 871
Reputation: 122
To add on kiran matthew's answer,
Since the model uses 64 cores of TPU v3, you may submit a quota increase request in multiples of 64 (eg. a multiplier of 64 (1 job 64, 2 concurrent jobs 128) under Restricted image training TPU V3 pod cores per region quota.
Upvotes: 0