duff18
duff18

Reputation: 752

Sagemaker Pipelines | Pass list of strings as parameter

The Sagemaker Pipeline only has Parameter classes for single values (a string, a float, etc), but how can I deal with a parameter that is best represented by a list (e.g. the list of features to select for training from a file with many features)?

Upvotes: 4

Views: 1405

Answers (1)

Giuseppe La Gualano
Giuseppe La Gualano

Reputation: 1720

Background: Following best practices, in general, of using feature names (e.g., column names of a dataframe pandas), these should be without spaces between them.

Base case

To bypass your problem, you can use a string as a parameter where each element is a single feature.

features = "feature_0 feature_1 feature_2"

and then, use it normally with ParameterString.

If it cannot be that way, I recommend inserting a specific separation pattern between names instead of space and splitting the whole string into features list later.

At this point, in the training script you pass the parameter to the ArgumentParser which you can configure to have the space-separated word string reprocessed into a list of individual words.

import argparse

if __name__ == "__main__":

    parser = argparse.ArgumentParser()

    parser.add_argument(
        "--features",
        nargs="*",
        type=str,
        default=[]
    )

    args, _ = parser.parse_known_args()

Extra case

Should the string mistakenly be interpreted as a list directly when passing the argument to a pipeline component (e.g., to a preprocessor), the latter can be reworked with an input reinterpretation function.

import itertools

def decode_list_of_strings_input(str_input: str) -> []:
    str_input = [s.split() for s in str_input]
    return list(itertools.chain.from_iterable(str_input))

Here is an example of the use of this code:

features = ['a b c']
features = decode_list_of_strings_input(features)

print(features)
>>> ['a', 'b', 'c']

Upvotes: 1

Related Questions