Zeite7Souls
Zeite7Souls

Reputation: 43

GCP Vertex AI Pipeline fails during build on the endpoint error

I have deployed a custom Kubeflow pipeline using a mix of AutoML components and a custom Kubeflow Component.

When I deploy the pipeline, it fails and I get the following error:

textPayload: "The replica workerpool0-0 exited with a non-zero status of 1. Termination reason: 
 Error. To find out more about why your job exited please check the logs: 
 https://console.cloud.google.com/logs/viewer? project=205438435937&resource=ml_job%2Fjob_id%XXXXXXXXXXXXXXXX&advancedFilter=resource.type%3D 
 %22ml_job%22%0Aresource.labels.job_id%3D%XXXXXXXXXXXXXXXXXXXX%22"
insertId: "ibt166bgd"
resource: {
 type: "ml_job"
  labels: {
   job_id: "XXXXXXXXXXXXXXXXXX"
   task_name: "service"
   project_id: "XXXXXXX-XXXXXX"
  }
 }
 timestamp: "2021-06-10T12:18:53.807150835Z"
 severity: "ERROR"
 labels: {
  ml.googleapis.com/endpoint: ""
 }
 logName: "projects/XXXXXXX-XXXXXX/logs/ml.googleapis.com%XXXXXXXXXXXXXXXXXXXX"
 receiveTimestamp: "2021-06-10T12:18:55.087983509Z"
}

This is my pipeline configuration:

# Kubeflow pipline defined by a Python function
@kfp.dsl.pipeline(
    name="sales-prediction-iowa",
    pipeline_root=pipeline_root_path)
def pipeline(project_id: str):
    pre_process = preprocess(
        project_id=project_id,
    )

    create_dataset = gcc_aip.TabularDatasetCreateOp(
    project=project_id,
    display_name=display_name,
    # gcs_source="gs://vertex-ai-pipeline-bucket/iowa-2020_pre-processed.csv"
    gcs_source=pre_process.output
    )


    training_job_run_op = gcc_aip.AutoMLTabularTrainingJobRunOp(
        project=project_id,
        display_name="training-iowa-sales",
        optimization_prediction_type="regression",
        dataset=create_dataset.outputs["dataset"],
        model_display_name="iowa-sales-model",
        target_column="sale_dollars",
        training_fraction_split=0.8,
        validation_fraction_split=0.1,
        test_fraction_split=0.1,
        budget_milli_node_hours=8000,
    )

    endpoint_op = gcc_aip.ModelDeployOp(
        project=project_id, model=training_job_run_op.outputs.model
    )


compiler.Compiler().compile(pipeline_func=pipeline,
        package_path='iowa-pipeline-job.json')

api_client = AIPlatformClient(project_id=project_id, region=region)

response = api_client.create_run_from_job_spec(
    'iowa-pipeline-job.json',
    pipeline_root=pipeline_root_path,
    service_account=service_account,
    parameter_values={
        'project_id': project_id,
        # 'region': region,
        # 'pipeline_root_path': pipeline_root_path,
        # 'service_account': service_account,
        # 'display_name': display_name
    }
)

I have a sneaky suspicion it might be linked to regions, but please let me know if there is something else her.

Thanks in advance!

Upvotes: 1

Views: 1089

Answers (0)

Related Questions