How to specify source directory and entry point for a SageMaker training job using Boto3 SDK? The use case is start training via Lambda call

Question

I've been running training jobs using SageMaker Python SDK on SageMaker notebook instances and locally using IAM credentials. They are working fine but I want to be able to start a training job via AWS Lambda + Gateway.

Lambda does not support SageMaker SDK (High-level SDK) so I am forced to use the SageMaker client from boto3 in my Lambda handler, e.g.

sagemaker = boto3.client('sagemaker')

Supposedly this boto3 service-level SDK would give me 100% control, but I can't find the argument or config name to specify a source directory and an entry point. I am running a custom training job that requires some data generation (using Keras generator) on the flight.

Here's an example of my SageMaker SDK call

tf_estimator = TensorFlow(base_job_name='tensorflow-nn-training',
                          role=sagemaker.get_execution_role(),
                          source_dir=training_src_path,
                          code_location=training_code_path,
                          output_path=training_output_path,
                          dependencies=['requirements.txt'],
                          entry_point='main.py',
                          script_mode=True,
                          instance_count=1,
                          instance_type='ml.g4dn.2xlarge',
                          framework_version='2.3',
                          py_version='py37',
                          hyperparameters={
                              'model-name': 'my-model-name',
                              'epochs': 1000,
                              'batch-size': 64,
                              'learning-rate': 0.01,
                              'training-split': 0.80,
                              'patience': 50,
                          })

The input path is injected via calling fit()

input_channels = {
    'train': training_input_path,
}
tf_estimator.fit(inputs=input_channels)

source_dir is a S3 URI to find my src.zip.gz which contains the model and script to perform a training.
entry_point is where the training begins. TensorFlow container simply runs python main.py
code_location is a S3 prefix where training source code can be uploaded to if I were to run this training locally using local model and script.
output_path is a S3 URI where the training job will upload model artifacts to.

However, I went through the documentation for SageMaker.Client.create_training_job, I couldn't find any field that allows me to set a source directory and entry point.

Here's an example,

sagemaker = boto3.client('sagemaker')
sagemaker.create_training_job(
    TrainingJobName='tf-training-job-from-lambda',
    Hyperparameters={} # Same dictionary as above,
    AlgorithmSpecification={
        'TrainingImage': '763104351884.dkr.ecr.us-west-1.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04',
        'TrainingInputMode': 'File',
        'EnableSageMakerMetricsTimeSeries': True
    },
    RoleArn='My execution role goes here',
    InputDataConfig=[
        {
            'ChannelName': 'train',
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'S3Prefix',
                    'S3Uri': training_input_path,
                    'S3DataDistributionType': 'FullyReplicated'
                }
            },
            'CompressionType': 'None',
            'RecordWrapperType': 'None',
            'InputMode': 'File',
        }  
    ],
    OutputDataConfig={
        'S3OutputPath': training_output_path,
    }
    ResourceConfig={
        'InstanceType': 'ml.g4dn.2xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 16
    }
    StoppingCondition={
        'MaxRuntimeInSeconds': 600 # 10 minutes for testing
    }
)

From the config above, the SDK accepts training input and output location, but which config field allows user to specify the source code directory and entry point?

JemmaChu · Accepted Answer

You can pass the source_dir to Hyperparameters like this:

    response = sm_boto3.create_training_job(
        TrainingJobName=f"{your job name}"),
        HyperParameters={
            'model-name': 'my-model-name',
            'epochs': 1000,
            'batch-size': 64,
            'learning-rate': 0.01,
            'training-split': 0.80,
            'patience': 50,
            "sagemaker_program": "script.py", # this is where you specify your train script
            "sagemaker_submit_directory": "s3://" + bucket + "/" + project + "/" + source, # your s3 URI like s3://sm/tensorflow/source/sourcedir.tar.gz
        },
        AlgorithmSpecification={
            "TrainingImage": training_image,
            ...
        },

Note: make sure it's xxx.tar.gz otherwise. Otherwise Sagemaker will throw errors.

Refer to https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_randomforest/Sklearn_on_SageMaker_end2end.ipynb

How to specify source directory and entry point for a SageMaker training job using Boto3 SDK? The use case is start training via Lambda call

Answers (1)

Related Questions