Inderpartap Cheema
Inderpartap Cheema

Reputation: 483

How to save Tensorflow model in S3 (as /output/model.tar.gz) when using Tensorflow Estimator in AWS Sagemaker

I have a Keras model getting trained using an entry_point script and I am using the following pieces of code to store the model artifacts (in the entry_point script).

parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
args, _ = parser.parse_known_args()
model_dir  = args.model_dir
---

tf.keras.models.save_model(
      model,
      os.path.join(model_dir, 'model/1'),
      overwrite=True,
      include_optimizer=True
     )

Ideally, the model_dir should be opt/ml/model and Sagemaker should automatically move the contents of this folder to S3 as s3://<default_bucket>/<training_name>/output/model.tar.gz

When I run the estimator.fit({'training': training_input_path}), the training is successful, but the Cloudwatch logs show the following:

2020-09-16 02:49:12,458 sagemaker_tensorflow_container.training WARNING  No model artifact is saved under the path /opt/ml/model. Your training job will not save any model files to S3.

Even then, Sagemaker does store my model artifacts, with the only difference being that instead of storing them in s3://<default_bucket>/<training_name>/output/model.tar.gz, they are now stored unzipped as s3://<default_bucket>/<training_name>/model/model/1/saved_model.pb along with the variables and assets folder. Because of this, estimator.deploy() call fails as it is unable to find the artifacts in the output/ directory.

Sagemaker Python SDK - 2.6.0

Estimator code:

from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(entry_point='autoencoder-model.py',
                       role=role,
                       instance_count=1,
                       instance_type='ml.m5.large',
                       framework_version="2.3.0",
                       py_version="py37",
                       debugger_hook_config=False,
                       hyperparameters={'epochs': 20},
                       source_dir='/home/ec2-user/SageMaker/model',
                       subnets=['subnet-1', 'subnet-2'],
                       security_group_ids=['sg-1', 'sg-1'])

What could I be doing wrong here?

Upvotes: 4

Views: 3984

Answers (1)

Sandeep Joshi
Sandeep Joshi

Reputation: 31

Update:

parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])

to:

parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) 

will work for me. The parser.add_argument converts hyphens to underscores, so you can reference args.model_dir still.

Sagemaker container will save the trained model in 'model-dir' and then make a zip file from this directory and upload to 's3 bucket' of location 'model_dir'.

'model-dir' is the location inside the container /opt/ml/..

'model_dir' is mapped with the 'output_path' which we define in:

tf_estimator = TensorFlow(entry_point='autoencoder-model.py', role=role,output_path=output_path,.....)

Hope this will help to resolve the isssue.

Upvotes: 1

Related Questions