uarfr
uarfr

Reputation: 53

Access Parameter's value directly in AWS Sagemaker Pipeline

Inside a function that returns a Pipeline, where a Parameter is defined, e.g. (taken from here)

def get_pipeline(...):
   
    foo = ParameterString(
        name="Foo", default_value="foo"
    )

   # pipeline's steps definition here
   step = ProcessingStep(name=...,
                         job_arguments=["--foo", foo]
   )

   return pipeline = Pipeline(
        name=pipeline_name,
        parameters=[...],
        steps=[...],
        sagemaker_session=sagemaker_session,
    )

I know I can access the default value of a parameter by simply calling foo.default_value, but how can I access its value when the default value is overridden at the runtime, e.g. by using

pipeline.start(parameters=dict(Foo='bar'))

?

My assumption is that in that case I don't want to read the default value, since it has been overridden, but the Parameter API is very limited and does not provided anything expect for name and default_value.

Upvotes: 5

Views: 3222

Answers (1)

Giuseppe La Gualano
Giuseppe La Gualano

Reputation: 1720

As written in the documentation:

Pipeline parameters can only be evaluated at run time. If a pipeline parameter needs to be evaluated at compile time, then it will throw an exception.

A way to use parameters as ProcessingStep arguments

If your requirement is to use them for a pipeline step, in particular the ProcessingStep, you will have to use the run method to use the arguments (which is different from job_arguments).

See this official example.

By passing the pipeline_session to the sagemaker_session, calling .run() does not launch the processing job, it returns the arguments needed to run the job as a step in the pipeline.

step_process = ProcessingStep(
   step_args=your_processor.run(
       # ...
       arguments=["--foo", foo]
   )
)

In addition, there are some limitations: Not all built-in Python operations can be applied to parameters.

An example taken from the link above:

# An example of what not to do
my_string = "s3://{}/training".format(ParameterString(name="MyBucket", default_value=""))

# Another example of what not to do
int_param = str(ParameterInteger(name="MyBucket", default_value=1))

# Instead, if you want to convert the parameter to string type, do
int_param.to_string()

# A workaround is to use Join
my_string = Join(on="", values=[
    "s3://",
    ParameterString(name="MyBucket", default_value=""),
    "/training"]
)

A way to use parameters to manipulate the pipeline internally

Personally, I prefer to pass the value directly when you get the pipeline definition before the start:

def get_pipeline(my_param_hardcoded, ...):

    # here you can use my_param_hardcoded
   
    my_param = ParameterString(
        name="Foo", default_value="foo"
    )

   # pipeline's steps definition here

   return pipeline = Pipeline(
        name=pipeline_name,
        parameters=[my_param, ...],
        steps=[...],
        sagemaker_session=sagemaker_session,
    )
   return pipeline
pipeline = get_pipeline(my_param_hardcoded, ...)
pipeline.start(parameters=dict(Foo=my_param_hardcoded))

Obviously this is not a really elegant way, but I do not think it is conceptually wrong because after all it is a parameter that will be used to manipulate the pipeline and cannot be pre-processed beforehand (e.g. in a configuration file).

An example of use is the creation of a name which can be based on the pipeline_name (which is clearly passed in the get_pipeline() and a pipeline parameter). For example, if we wanted to create a custom name for a step, it could be given by the concatenation of the two strings, and this cannot happen at runtime but must be done with this trick.

Upvotes: 4

Related Questions