GCP Vertex AI Pipeline fails during build on the endpoint error

Question

I have deployed a custom Kubeflow pipeline using a mix of AutoML components and a custom Kubeflow Component.

When I deploy the pipeline, it fails and I get the following error:

textPayload: "The replica workerpool0-0 exited with a non-zero status of 1. Termination reason: 
 Error. To find out more about why your job exited please check the logs: 
 https://console.cloud.google.com/logs/viewer? project=205438435937&resource=ml_job%2Fjob_id%XXXXXXXXXXXXXXXX&advancedFilter=resource.type%3D 
 %22ml_job%22%0Aresource.labels.job_id%3D%XXXXXXXXXXXXXXXXXXXX%22"
insertId: "ibt166bgd"
resource: {
 type: "ml_job"
  labels: {
   job_id: "XXXXXXXXXXXXXXXXXX"
   task_name: "service"
   project_id: "XXXXXXX-XXXXXX"
  }
 }
 timestamp: "2021-06-10T12:18:53.807150835Z"
 severity: "ERROR"
 labels: {
  ml.googleapis.com/endpoint: ""
 }
 logName: "projects/XXXXXXX-XXXXXX/logs/ml.googleapis.com%XXXXXXXXXXXXXXXXXXXX"
 receiveTimestamp: "2021-06-10T12:18:55.087983509Z"
}

This is my pipeline configuration:

# Kubeflow pipline defined by a Python function
@kfp.dsl.pipeline(
    name="sales-prediction-iowa",
    pipeline_root=pipeline_root_path)
def pipeline(project_id: str):
    pre_process = preprocess(
        project_id=project_id,
    )

    create_dataset = gcc_aip.TabularDatasetCreateOp(
    project=project_id,
    display_name=display_name,
    # gcs_source="gs://vertex-ai-pipeline-bucket/iowa-2020_pre-processed.csv"
    gcs_source=pre_process.output
    )


    training_job_run_op = gcc_aip.AutoMLTabularTrainingJobRunOp(
        project=project_id,
        display_name="training-iowa-sales",
        optimization_prediction_type="regression",
        dataset=create_dataset.outputs["dataset"],
        model_display_name="iowa-sales-model",
        target_column="sale_dollars",
        training_fraction_split=0.8,
        validation_fraction_split=0.1,
        test_fraction_split=0.1,
        budget_milli_node_hours=8000,
    )

    endpoint_op = gcc_aip.ModelDeployOp(
        project=project_id, model=training_job_run_op.outputs.model
    )


compiler.Compiler().compile(pipeline_func=pipeline,
        package_path='iowa-pipeline-job.json')

api_client = AIPlatformClient(project_id=project_id, region=region)

response = api_client.create_run_from_job_spec(
    'iowa-pipeline-job.json',
    pipeline_root=pipeline_root_path,
    service_account=service_account,
    parameter_values={
        'project_id': project_id,
        # 'region': region,
        # 'pipeline_root_path': pipeline_root_path,
        # 'service_account': service_account,
        # 'display_name': display_name
    }
)

I have a sneaky suspicion it might be linked to regions, but please let me know if there is something else her.

Thanks in advance!

GCP Vertex AI Pipeline fails during build on the endpoint error

Answers (0)

Related Questions