Reputation: 585
Hello I thought my problem is simple but trying to google for the answer showed me something else: Within different Sagemaker Pipeline Steps (e.g ClarifyCheckStep) I want to get the Pipeline execution id so I can save the output of different steps in a nice manner and structure the saving of my output. Does anyone have an idea? Pipeline execution variables cannot be used in string format it seems: https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#execution-variables
Upvotes: 1
Views: 1338
Reputation: 194
You are right execution variables cannot be used as strings but they can be used in using joins (in pipeline definitions of each step) e.g. Join(on='/', values=[s3_prefix, 'predictions', ExecutionVariables.PIPELINE_EXECUTION_ID])
s3_prefix is like 's3://a/b'. Above code will create a path whose last folder will be named as per pipeline execution id. So above path looks like 's3://a/b/predictions/<execution_id>'
You can pass pipeline parameters as well in Join.
Within pipeline steps you can pass pipeline execution id as environment variable or argument (whatever suits). Pipeline parameters can also be passed in same manner. "env" argument within processor Using this you can organize you data while executing sagemaker pipelines
Upvotes: 1
Reputation: 1720
In order to save outputs following a certain structure, having in common the execution of the pipeline, the most robust method currently present is to use the code_location
and output_path
parameters of the various steps by previously creating a path that has the pipeline_name
and possibly other details with a timestamp that guarantees its uniqueness.
Then, when you get your pipeline definition (e.g., with a get_pipeline() function), you can pass the pipeline_name and other variables. An example is as follows:
import time
pipeline = your_pipeline_script.get_pipeline(
region = region,
role = role,
pipeline_name = your_pipeline_name,
pipeline_detail = some_details + "-" + time.strftime("%Y%m%d%H%M%S", time.gmtime()),
)
your output destination may become something like this:
outputs_destination = f"s3://{pipeline_session.default_bucket()}/pipeline/{pipeline_name}/{pipeline_detail}"
This way is your path is pregenerated before the pipeline is executed and is controllable with whatever parameter you want to enter.
One idea might be to create subfolders that have names of some particular parameter. The important thing is that it follows a well-defined and easily recognizable structure.
Upvotes: 0