Reputation: 752
The Sagemaker Pipeline only has Parameter classes for single values (a string, a float, etc), but how can I deal with a parameter that is best represented by a list (e.g. the list of features to select for training from a file with many features)?
Upvotes: 4
Views: 1405
Reputation: 1720
Background: Following best practices, in general, of using feature names (e.g., column names of a dataframe pandas), these should be without spaces between them.
To bypass your problem, you can use a string as a parameter where each element is a single feature.
features = "feature_0 feature_1 feature_2"
and then, use it normally with ParameterString.
If it cannot be that way, I recommend inserting a specific separation pattern between names instead of space and splitting the whole string into features list later.
At this point, in the training script you pass the parameter to the ArgumentParser which you can configure to have the space-separated word string reprocessed into a list of individual words.
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--features",
nargs="*",
type=str,
default=[]
)
args, _ = parser.parse_known_args()
Should the string mistakenly be interpreted as a list directly when passing the argument to a pipeline component (e.g., to a preprocessor), the latter can be reworked with an input reinterpretation function.
import itertools
def decode_list_of_strings_input(str_input: str) -> []:
str_input = [s.split() for s in str_input]
return list(itertools.chain.from_iterable(str_input))
Here is an example of the use of this code:
features = ['a b c']
features = decode_list_of_strings_input(features)
print(features)
>>> ['a', 'b', 'c']
Upvotes: 1