Reputation: 725
I've been running training jobs using SageMaker Python SDK on SageMaker notebook instances and locally using IAM credentials. They are working fine but I want to be able to start a training job via AWS Lambda + Gateway.
Lambda does not support SageMaker SDK (High-level SDK) so I am forced to use the SageMaker client from boto3
in my Lambda handler, e.g.
sagemaker = boto3.client('sagemaker')
Supposedly this boto3 service-level SDK would give me 100% control, but I can't find the argument or config name to specify a source directory and an entry point. I am running a custom training job that requires some data generation (using Keras generator) on the flight.
Here's an example of my SageMaker SDK call
tf_estimator = TensorFlow(base_job_name='tensorflow-nn-training',
role=sagemaker.get_execution_role(),
source_dir=training_src_path,
code_location=training_code_path,
output_path=training_output_path,
dependencies=['requirements.txt'],
entry_point='main.py',
script_mode=True,
instance_count=1,
instance_type='ml.g4dn.2xlarge',
framework_version='2.3',
py_version='py37',
hyperparameters={
'model-name': 'my-model-name',
'epochs': 1000,
'batch-size': 64,
'learning-rate': 0.01,
'training-split': 0.80,
'patience': 50,
})
The input path is injected via calling fit()
input_channels = {
'train': training_input_path,
}
tf_estimator.fit(inputs=input_channels)
source_dir
is a S3 URI to find my src.zip.gz
which contains the model and script to
perform a training.entry_point
is where the training begins. TensorFlow container simply runs python main.py
code_location
is a S3 prefix where training source code can be uploaded to if I were to run
this training locally using local model and script.output_path
is a S3 URI where the training job will upload model artifacts to.However, I went through the documentation for SageMaker.Client.create_training_job, I couldn't find any field that allows me to set a source directory and entry point.
Here's an example,
sagemaker = boto3.client('sagemaker')
sagemaker.create_training_job(
TrainingJobName='tf-training-job-from-lambda',
Hyperparameters={} # Same dictionary as above,
AlgorithmSpecification={
'TrainingImage': '763104351884.dkr.ecr.us-west-1.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04',
'TrainingInputMode': 'File',
'EnableSageMakerMetricsTimeSeries': True
},
RoleArn='My execution role goes here',
InputDataConfig=[
{
'ChannelName': 'train',
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': training_input_path,
'S3DataDistributionType': 'FullyReplicated'
}
},
'CompressionType': 'None',
'RecordWrapperType': 'None',
'InputMode': 'File',
}
],
OutputDataConfig={
'S3OutputPath': training_output_path,
}
ResourceConfig={
'InstanceType': 'ml.g4dn.2xlarge',
'InstanceCount': 1,
'VolumeSizeInGB': 16
}
StoppingCondition={
'MaxRuntimeInSeconds': 600 # 10 minutes for testing
}
)
From the config above, the SDK accepts training input and output location, but which config field allows user to specify the source code directory and entry point?
Upvotes: 0
Views: 4329
Reputation: 66
You can pass the source_dir to Hyperparameters like this:
response = sm_boto3.create_training_job(
TrainingJobName=f"{your job name}"),
HyperParameters={
'model-name': 'my-model-name',
'epochs': 1000,
'batch-size': 64,
'learning-rate': 0.01,
'training-split': 0.80,
'patience': 50,
"sagemaker_program": "script.py", # this is where you specify your train script
"sagemaker_submit_directory": "s3://" + bucket + "/" + project + "/" + source, # your s3 URI like s3://sm/tensorflow/source/sourcedir.tar.gz
},
AlgorithmSpecification={
"TrainingImage": training_image,
...
},
Note: make sure it's xxx.tar.gz otherwise. Otherwise Sagemaker will throw errors.
Upvotes: 1