Reputation: 483
I have a Keras model getting trained using an entry_point script and I am using the following pieces of code to store the model artifacts (in the entry_point script).
parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
args, _ = parser.parse_known_args()
model_dir = args.model_dir
---
tf.keras.models.save_model(
model,
os.path.join(model_dir, 'model/1'),
overwrite=True,
include_optimizer=True
)
Ideally, the model_dir should be opt/ml/model
and Sagemaker should automatically move the contents of this folder to S3 as s3://<default_bucket>/<training_name>/output/model.tar.gz
When I run the estimator.fit({'training': training_input_path})
, the training is successful, but the Cloudwatch logs show the following:
2020-09-16 02:49:12,458 sagemaker_tensorflow_container.training WARNING No model artifact is saved under the path /opt/ml/model. Your training job will not save any model files to S3.
Even then, Sagemaker does store my model artifacts, with the only difference being that instead of storing them in s3://<default_bucket>/<training_name>/output/model.tar.gz
, they are now stored unzipped as s3://<default_bucket>/<training_name>/model/model/1/saved_model.pb
along with the variables and assets folder. Because of this, estimator.deploy()
call fails as it is unable to find the artifacts in the output/ directory.
Sagemaker Python SDK - 2.6.0
Estimator code:
from sagemaker.tensorflow import TensorFlow
tf_estimator = TensorFlow(entry_point='autoencoder-model.py',
role=role,
instance_count=1,
instance_type='ml.m5.large',
framework_version="2.3.0",
py_version="py37",
debugger_hook_config=False,
hyperparameters={'epochs': 20},
source_dir='/home/ec2-user/SageMaker/model',
subnets=['subnet-1', 'subnet-2'],
security_group_ids=['sg-1', 'sg-1'])
What could I be doing wrong here?
Upvotes: 4
Views: 3984
Reputation: 31
Update:
parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
to:
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
will work for me. The parser.add_argument
converts hyphens to underscores, so you can reference args.model_dir
still.
Sagemaker container will save the trained model in 'model-dir' and then make a zip file from this directory and upload to 's3 bucket' of location 'model_dir'.
'model-dir' is the location inside the container /opt/ml/..
'model_dir' is mapped with the 'output_path' which we define in:
tf_estimator = TensorFlow(entry_point='autoencoder-model.py', role=role,output_path=output_path,.....)
Hope this will help to resolve the isssue.
Upvotes: 1