mofury
mofury

Reputation: 725

How to specify source directory and entry point for a SageMaker training job using Boto3 SDK? The use case is start training via Lambda call

I've been running training jobs using SageMaker Python SDK on SageMaker notebook instances and locally using IAM credentials. They are working fine but I want to be able to start a training job via AWS Lambda + Gateway.

Lambda does not support SageMaker SDK (High-level SDK) so I am forced to use the SageMaker client from boto3 in my Lambda handler, e.g.

sagemaker = boto3.client('sagemaker')

Supposedly this boto3 service-level SDK would give me 100% control, but I can't find the argument or config name to specify a source directory and an entry point. I am running a custom training job that requires some data generation (using Keras generator) on the flight.

Here's an example of my SageMaker SDK call

tf_estimator = TensorFlow(base_job_name='tensorflow-nn-training',
                          role=sagemaker.get_execution_role(),
                          source_dir=training_src_path,
                          code_location=training_code_path,
                          output_path=training_output_path,
                          dependencies=['requirements.txt'],
                          entry_point='main.py',
                          script_mode=True,
                          instance_count=1,
                          instance_type='ml.g4dn.2xlarge',
                          framework_version='2.3',
                          py_version='py37',
                          hyperparameters={
                              'model-name': 'my-model-name',
                              'epochs': 1000,
                              'batch-size': 64,
                              'learning-rate': 0.01,
                              'training-split': 0.80,
                              'patience': 50,
                          })

The input path is injected via calling fit()

input_channels = {
    'train': training_input_path,
}
tf_estimator.fit(inputs=input_channels)

However, I went through the documentation for SageMaker.Client.create_training_job, I couldn't find any field that allows me to set a source directory and entry point.

Here's an example,

sagemaker = boto3.client('sagemaker')
sagemaker.create_training_job(
    TrainingJobName='tf-training-job-from-lambda',
    Hyperparameters={} # Same dictionary as above,
    AlgorithmSpecification={
        'TrainingImage': '763104351884.dkr.ecr.us-west-1.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04',
        'TrainingInputMode': 'File',
        'EnableSageMakerMetricsTimeSeries': True
    },
    RoleArn='My execution role goes here',
    InputDataConfig=[
        {
            'ChannelName': 'train',
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'S3Prefix',
                    'S3Uri': training_input_path,
                    'S3DataDistributionType': 'FullyReplicated'
                }
            },
            'CompressionType': 'None',
            'RecordWrapperType': 'None',
            'InputMode': 'File',
        }  
    ],
    OutputDataConfig={
        'S3OutputPath': training_output_path,
    }
    ResourceConfig={
        'InstanceType': 'ml.g4dn.2xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 16
    }
    StoppingCondition={
        'MaxRuntimeInSeconds': 600 # 10 minutes for testing
    }
)

From the config above, the SDK accepts training input and output location, but which config field allows user to specify the source code directory and entry point?

Upvotes: 0

Views: 4329

Answers (1)

JemmaChu
JemmaChu

Reputation: 66

You can pass the source_dir to Hyperparameters like this:

    response = sm_boto3.create_training_job(
        TrainingJobName=f"{your job name}"),
        HyperParameters={
            'model-name': 'my-model-name',
            'epochs': 1000,
            'batch-size': 64,
            'learning-rate': 0.01,
            'training-split': 0.80,
            'patience': 50,
            "sagemaker_program": "script.py", # this is where you specify your train script
            "sagemaker_submit_directory": "s3://" + bucket + "/" + project + "/" + source, # your s3 URI like s3://sm/tensorflow/source/sourcedir.tar.gz
        },
        AlgorithmSpecification={
            "TrainingImage": training_image,
            ...
        }, 

Note: make sure it's xxx.tar.gz otherwise. Otherwise Sagemaker will throw errors.

Refer to https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_randomforest/Sklearn_on_SageMaker_end2end.ipynb

Upvotes: 1

Related Questions