aronchick
aronchick

Reputation: 7128

Using InputPath/Output Path on Vertex?

TLDR, does Google Cloud Vertex support InputPath/OutputPath?

I have set up my environment on Google Cloud in the following way:

export PROJECT_ID="xxx"
export SERVICE_ACCOUNT_ID="xxx"
export USER_EMAIL="xxx"
export BUCKET_NAME="xxx"
export FILE_NAME="xxx"
export GOOGLE_APPLICATION_CREDENTIALS="xxx"


gcloud iam service-accounts create $SERVICE_ACCOUNT_ID \
--description="Service principal for running vertex and creating pipelines/metadata" \
--display-name="$SERVICE_ACCOUNT_ID" \
--project ${PROJECT_ID}

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com" \
    --role=roles/storage.objectAdmin

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com" \
    --role=roles/aiplatform.user

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com" \
    --role=roles/ml.admin

gcloud projects get-iam-policy $PROJECT_ID \
    --flatten="bindings[].members" \
    --format='table(bindings.role)' \
    --filter="bindings.members:serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com"

gcloud iam service-accounts add-iam-policy-binding \
    $SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com \
    --member="user:$USER_EMAIL" \
    --role="roles/iam.serviceAccountUser"
    --project ${PROJECT_ID}

gsutil mb -p $PROJECT_ID gs://$BUCKET_NAME

gsutil iam ch \
    serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com:roles/storage.objectCreator,objectViewer \
    gs://$BUCKET_NAME

https://cloud.google.com/docs/authentication/getting-started#auth-cloud-implicit-python
gcloud iam service-accounts keys create $FILE_NAME.json --iam-account=$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com 

I then have the following three blocks of code:

import kfp.v2.components
from kfp.v2.dsl import InputPath
from kubernetes.client.models import V1EnvVar
from kubernetes import client, config
from typing import NamedTuple
from base64 import b64encode
import kfp.v2.dsl as dsl
import kubernetes
import json
import kfp
from google.cloud import aiplatform
import datetime
import pprint as pp
import requests

from step_1 import step_1_fn
from step_2 import step_2_fn

step_1_comp = kfp.v2.dsl.component(
    func=step_1_fn,
    base_image="library/python:3.10-slim-buster",
    packages_to_install=[
        "dill",
    ],
)
step_2_comp = kfp.v2.dsl.component(
    func=step_2_fn,
    base_image="library/python:3.10-slim-buster",
    packages_to_install=[
        "dill",
    ],
)


@kfp.dsl.pipeline(
    pipeline_root="gs://minimal_vertex_test_bucket",
    name="minimalcompile",
)
def root():
    step_1_exec = step_1_comp()
    step_2_exec = step_2_comp(input_context_path=step_1_exec.outputs["output_context_path"])
import kfp
from kfp.v2.dsl import component, Artifact, Input, InputPath, Output, OutputPath, Dataset, Model
from typing import NamedTuple


def step_1_fn(
    output_context_path: OutputPath(str),
):
    import dill

    a = 10

    dill.dump_session("{output_context_path}")

import kfp
from kfp.v2.dsl import component, Artifact, Input, InputPath, Output, OutputPath, Dataset, Model
from typing import NamedTuple


def step_2_fn(
    input_context_path: InputPath(str),
    output_context_path: OutputPath(str),
    metadata_url: str = "",
):
    from base64 import urlsafe_b64encode, urlsafe_b64decode
    import dill
    from pathlib import Path

    with Path(input_context_path).open("rb") as reader:
        input_context = reader.read()
    
    print(f"A = {a}")

    dill.dump_session("{output_context_path}")

I use the following command to compile the above:

compiler.Compiler().compile(pipeline_func=root_module.root, package_path=".")

Which generates a correct root.json file.

However, when I upload this to Google Cloud Vertex, I get the following error at Step 2 -

2022-06-13 16:41:34.796 PDT

workerpool0-0
[KFP Executor 2022-06-13 23:41:34,795 INFO]: Looking for component `step_2_fn` in --component_module_path `/tmp/tmp.cgPjtzQEt1/ephemeral_component.py`
2022-06-13 16:41:34.796 PDT

workerpool0-0
[KFP Executor 2022-06-13 23:41:34,795 INFO]: Loading KFP component "step_2_fn" from /tmp/tmp.cgPjtzQEt1/ephemeral_component.py (directory "/tmp/tmp.cgPjtzQEt1" and module name "ephemeral_component")
2022-06-13 16:41:34.798 PDT

workerpool0-0
Traceback (most recent call last):
2022-06-13 16:41:34.798 PDT

workerpool0-0
 File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2022-06-13 16:41:34.798 PDT

workerpool0-0
 return _run_code(code, main_globals, None,
2022-06-13 16:41:34.798 PDT

workerpool0-0
 File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
2022-06-13 16:41:34.798 PDT

workerpool0-0
 exec(code, run_globals)
2022-06-13 16:41:34.798 PDT

workerpool0-0
 File "/usr/local/lib/python3.10/site-packages/kfp/v2/components/executor_main.py", line 104, in <module>
2022-06-13 16:41:34.798 PDT

workerpool0-0
 executor_main()
2022-06-13 16:41:34.798 PDT

workerpool0-0
 File "/usr/local/lib/python3.10/site-packages/kfp/v2/components/executor_main.py", line 100, in executor_main
2022-06-13 16:41:34.798 PDT

workerpool0-0
 executor.execute()
2022-06-13 16:41:34.798 PDT

workerpool0-0
 File "/usr/local/lib/python3.10/site-packages/kfp/v2/components/executor.py", line 307, in execute
2022-06-13 16:41:34.798 PDT

workerpool0-0
 func_kwargs[k] = self._get_input_artifact_path(k)
2022-06-13 16:41:34.798 PDT

workerpool0-0
 File "/usr/local/lib/python3.10/site-packages/kfp/v2/components/executor.py", line 116, in _get_input_artifact_path
2022-06-13 16:41:34.799 PDT

workerpool0-0
 raise ValueError(
2022-06-13 16:41:34.799 PDT

workerpool0-0
ValueError: Failed to get input artifact path for artifact name input_context_path
2022-06-13 16:41:52.005 PDT

However, when I look at the output_context_path value of step 1, I see this:

--executor_input; {"outputs":{"outputFile":"/gcs/minimal_vertex_test_bucket/669070936339/minimalcompile-20220613043819/step-1-fn_3079644658226167808/executor_output.json","parameters":{"output_context_path":{"outputFile":"/gcs/minimal_vertex_test_bucket/669070936339/minimalcompile-20220613043819/step-1-fn_3079644658226167808/output_context_path"}}}}; --function_to_execute; step_1_fn

The input path for step 2, it says this:

--executor_input; {"inputs":{"parameterValues":{"input_context_path":"","metadata_url":""},"parameters":{"input_context_path":{"stringValue":""},"metadata_url":{"stringValue":""}}},"outputs":{"outputFile":"/gcs/minimal_vertex_test_bucket/669070936339/minimalcompile-20220613043819/step-2-fn_-6143727378628608000/executor_output.json","parameters":{"output_context_path":{"outputFile":"/gcs/minimal_vertex_test_bucket/669070936339/minimalcompile-20220613043819/step-2-fn_-6143727378628608000/output_context_path"}}}}; --function_to_execute; step_2_fn

This would imply that the "outputFile" from step 1 is not being read as the input value in step 2. (output_context_path has a value, input_context_path in step 2 is empty).

So, the question is, is InputPath/OutputPath supported? If not, what's the standard practice for writing a large file from one step and importing that large file into the second step?

Upvotes: 0

Views: 1256

Answers (1)

TheManWhoKnows
TheManWhoKnows

Reputation: 31

Can't answer your question atm if it is supported or not. However, based on what you posted:

  • The quickest and easiest way appears to call: output_context_path.outputs["outputFile"]

  • Also try to assign class attribute, such as: output_context_path.path = "<your_path>" and then try to recall it.

You can also try Input[Dataset] and Output[Dataset] with similar assignments; I know this works for me.

Upvotes: 1

Related Questions