Jan Zajac
Jan Zajac

Reputation: 21

Vertex pipeline model training component stuck running forever because of metadata issue

I'm attempting to run a Vertex pipeline (custom model training) which I was able to run successfully in a different project. As far as I'm aware, all the pieces of infrastructure (service accounts, buckets, etc.) are identical.

The error appears in a gray box in the pipeline UI when I click on the model training component and reads the following:

Retryable error reported. System is retrying.
com.google.cloud.ai.platform.common.errors.AiPlatformException: code=ABORTED, message=Specified Execution `etag`: `1662555654045` does not match server `etag`: `1662555533339`, cause=null System is retrying.

I've looked into the log explorer and found that the error logs are audit logs have the following associated tags with them:

protoPayload.methodName="google.cloud.aiplatform.internal.MetadataService.RefreshLineageSubgraph"

protoPayload.resourceName="projects/724306335858/locations/europe-west4/metadataStores/default

Leading me to think that there's an issue with the Vertex Metadatastore or the way my pipeline is using it. The audit logs are automatic though, so I'm not sure.

I've tried purging the metadata store as well as deleting it completely. I've also tried running a different model training pipeline that worked before in a different project as well but with no luck.

screenshot of ui

Upvotes: 1

Views: 768

Answers (1)

Prajna Rai T
Prajna Rai T

Reputation: 1820

Retryable error which you were getting is the temporary issue, the issue is resolved now.

You can now be able to rerun the pipeline and it is not expected to enter the infinite retry loop.

Upvotes: 1

Related Questions