Reputation: 7128
TLDR, does Google Cloud Vertex support InputPath/OutputPath?
I have set up my environment on Google Cloud in the following way:
export PROJECT_ID="xxx"
export SERVICE_ACCOUNT_ID="xxx"
export USER_EMAIL="xxx"
export BUCKET_NAME="xxx"
export FILE_NAME="xxx"
export GOOGLE_APPLICATION_CREDENTIALS="xxx"
gcloud iam service-accounts create $SERVICE_ACCOUNT_ID \
--description="Service principal for running vertex and creating pipelines/metadata" \
--display-name="$SERVICE_ACCOUNT_ID" \
--project ${PROJECT_ID}
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member="serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com" \
--role=roles/storage.objectAdmin
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member="serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com" \
--role=roles/aiplatform.user
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member="serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com" \
--role=roles/ml.admin
gcloud projects get-iam-policy $PROJECT_ID \
--flatten="bindings[].members" \
--format='table(bindings.role)' \
--filter="bindings.members:serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com"
gcloud iam service-accounts add-iam-policy-binding \
$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com \
--member="user:$USER_EMAIL" \
--role="roles/iam.serviceAccountUser"
--project ${PROJECT_ID}
gsutil mb -p $PROJECT_ID gs://$BUCKET_NAME
gsutil iam ch \
serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com:roles/storage.objectCreator,objectViewer \
gs://$BUCKET_NAME
https://cloud.google.com/docs/authentication/getting-started#auth-cloud-implicit-python
gcloud iam service-accounts keys create $FILE_NAME.json --iam-account=$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com
I then have the following three blocks of code:
import kfp.v2.components
from kfp.v2.dsl import InputPath
from kubernetes.client.models import V1EnvVar
from kubernetes import client, config
from typing import NamedTuple
from base64 import b64encode
import kfp.v2.dsl as dsl
import kubernetes
import json
import kfp
from google.cloud import aiplatform
import datetime
import pprint as pp
import requests
from step_1 import step_1_fn
from step_2 import step_2_fn
step_1_comp = kfp.v2.dsl.component(
func=step_1_fn,
base_image="library/python:3.10-slim-buster",
packages_to_install=[
"dill",
],
)
step_2_comp = kfp.v2.dsl.component(
func=step_2_fn,
base_image="library/python:3.10-slim-buster",
packages_to_install=[
"dill",
],
)
@kfp.dsl.pipeline(
pipeline_root="gs://minimal_vertex_test_bucket",
name="minimalcompile",
)
def root():
step_1_exec = step_1_comp()
step_2_exec = step_2_comp(input_context_path=step_1_exec.outputs["output_context_path"])
import kfp
from kfp.v2.dsl import component, Artifact, Input, InputPath, Output, OutputPath, Dataset, Model
from typing import NamedTuple
def step_1_fn(
output_context_path: OutputPath(str),
):
import dill
a = 10
dill.dump_session("{output_context_path}")
import kfp
from kfp.v2.dsl import component, Artifact, Input, InputPath, Output, OutputPath, Dataset, Model
from typing import NamedTuple
def step_2_fn(
input_context_path: InputPath(str),
output_context_path: OutputPath(str),
metadata_url: str = "",
):
from base64 import urlsafe_b64encode, urlsafe_b64decode
import dill
from pathlib import Path
with Path(input_context_path).open("rb") as reader:
input_context = reader.read()
print(f"A = {a}")
dill.dump_session("{output_context_path}")
I use the following command to compile the above:
compiler.Compiler().compile(pipeline_func=root_module.root, package_path=".")
Which generates a correct root.json
file.
However, when I upload this to Google Cloud Vertex, I get the following error at Step 2 -
2022-06-13 16:41:34.796 PDT
workerpool0-0
[KFP Executor 2022-06-13 23:41:34,795 INFO]: Looking for component `step_2_fn` in --component_module_path `/tmp/tmp.cgPjtzQEt1/ephemeral_component.py`
2022-06-13 16:41:34.796 PDT
workerpool0-0
[KFP Executor 2022-06-13 23:41:34,795 INFO]: Loading KFP component "step_2_fn" from /tmp/tmp.cgPjtzQEt1/ephemeral_component.py (directory "/tmp/tmp.cgPjtzQEt1" and module name "ephemeral_component")
2022-06-13 16:41:34.798 PDT
workerpool0-0
Traceback (most recent call last):
2022-06-13 16:41:34.798 PDT
workerpool0-0
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2022-06-13 16:41:34.798 PDT
workerpool0-0
return _run_code(code, main_globals, None,
2022-06-13 16:41:34.798 PDT
workerpool0-0
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
2022-06-13 16:41:34.798 PDT
workerpool0-0
exec(code, run_globals)
2022-06-13 16:41:34.798 PDT
workerpool0-0
File "/usr/local/lib/python3.10/site-packages/kfp/v2/components/executor_main.py", line 104, in <module>
2022-06-13 16:41:34.798 PDT
workerpool0-0
executor_main()
2022-06-13 16:41:34.798 PDT
workerpool0-0
File "/usr/local/lib/python3.10/site-packages/kfp/v2/components/executor_main.py", line 100, in executor_main
2022-06-13 16:41:34.798 PDT
workerpool0-0
executor.execute()
2022-06-13 16:41:34.798 PDT
workerpool0-0
File "/usr/local/lib/python3.10/site-packages/kfp/v2/components/executor.py", line 307, in execute
2022-06-13 16:41:34.798 PDT
workerpool0-0
func_kwargs[k] = self._get_input_artifact_path(k)
2022-06-13 16:41:34.798 PDT
workerpool0-0
File "/usr/local/lib/python3.10/site-packages/kfp/v2/components/executor.py", line 116, in _get_input_artifact_path
2022-06-13 16:41:34.799 PDT
workerpool0-0
raise ValueError(
2022-06-13 16:41:34.799 PDT
workerpool0-0
ValueError: Failed to get input artifact path for artifact name input_context_path
2022-06-13 16:41:52.005 PDT
However, when I look at the output_context_path value of step 1, I see this:
--executor_input; {"outputs":{"outputFile":"/gcs/minimal_vertex_test_bucket/669070936339/minimalcompile-20220613043819/step-1-fn_3079644658226167808/executor_output.json","parameters":{"output_context_path":{"outputFile":"/gcs/minimal_vertex_test_bucket/669070936339/minimalcompile-20220613043819/step-1-fn_3079644658226167808/output_context_path"}}}}; --function_to_execute; step_1_fn
The input path for step 2, it says this:
--executor_input; {"inputs":{"parameterValues":{"input_context_path":"","metadata_url":""},"parameters":{"input_context_path":{"stringValue":""},"metadata_url":{"stringValue":""}}},"outputs":{"outputFile":"/gcs/minimal_vertex_test_bucket/669070936339/minimalcompile-20220613043819/step-2-fn_-6143727378628608000/executor_output.json","parameters":{"output_context_path":{"outputFile":"/gcs/minimal_vertex_test_bucket/669070936339/minimalcompile-20220613043819/step-2-fn_-6143727378628608000/output_context_path"}}}}; --function_to_execute; step_2_fn
This would imply that the "outputFile" from step 1 is not being read as the input value in step 2. (output_context_path has a value, input_context_path in step 2 is empty).
So, the question is, is InputPath/OutputPath supported? If not, what's the standard practice for writing a large file from one step and importing that large file into the second step?
Upvotes: 0
Views: 1256
Reputation: 31
Can't answer your question atm if it is supported or not. However, based on what you posted:
The quickest and easiest way appears to call:
output_context_path.outputs["outputFile"]
Also try to assign class attribute, such as: output_context_path.path = "<your_path>"
and then try to recall it.
You can also try Input[Dataset]
and Output[Dataset]
with similar assignments; I know this works for me.
Upvotes: 1