Reputation: 45
Ok I've been dealing with this issue in Sagemaker for almost a week and I'm ready to pull my hair out. I've got a custom training script paired with a data processing script in a BYO algorithm Docker deployment type scenario. It's a Pytorch model built with Python 3.x, and the BYO Docker file was originally built for Python 2, but I can't see an issue with the problem that I am having.....which is that after a successful training run Sagemaker doesn't save the model to the target S3 bucket.
I've searched far and wide and can't seem to find an applicable answer anywhere. This is all done inside a Notebook instance. Note: I am using this as a contractor and don't have full permissions to the rest of AWS, including downloading the Docker image.
Dockerfile:
FROM ubuntu:18.04
MAINTAINER Amazon AI <[email protected]>
RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
python-pip \
python3-pip3
nginx \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py && \
pip3 install future numpy torch scipy scikit-learn pandas flask gevent gunicorn && \
rm -rf /root/.cache
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"
COPY decision_trees /opt/program
WORKDIR /opt/program
Docker Image Build:
%%sh
algorithm_name="name-this-algo"
cd container
chmod +x decision_trees/train
chmod +x decision_trees/serve
account=$(aws sts get-caller-identity --query Account --output text)
region=$(aws configure get region)
region=${region:-us-east-2}
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi
# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)
# Build the docker image locally with the image name and then push it to ECR
# with the full name.
docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}
docker push ${fullname}
Env setup and session start:
common_prefix = "pytorch-lstm"
training_input_prefix = common_prefix + "/training-input-data"
batch_inference_input_prefix = common_prefix + "/batch-inference-input-data"
import os
from sagemaker import get_execution_role
import sagemaker as sage
sess = sage.Session()
role = get_execution_role()
print(role)
Training Directory, Image, and Estimator Setup, then a fit
call:
TRAINING_WORKDIR = "a/local/directory"
training_input = sess.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print ("Training Data Location " + training_input)
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/image-that-works:working'.format(account, region)
tree = sage.estimator.Estimator(image,
role, 1, 'ml.p2.xlarge',
output_path="s3://sagemaker-directory-that-definitely/exists",
sagemaker_session=sess)
tree.fit(training_input)
The above script is working, for sure. I have print statements in my script and they are printing the expected results to the console. This runs as it's supposed to, finishes up, and says that it's deploying model artifacts when IT DEFINITELY DOES NOT.
Model Deployment:
model = tree.create_model()
predictor = tree.deploy(1, 'ml.m4.xlarge')
This throws an error that the model can't be found. A call to aws sagemaker describe-training-job
shows that the training was completed but I found that the time it took to upload the model was super fast, so obviously there's an error somewhere and it's not telling me. Thankfully it's not just uploading it to the aether.
{
"Status": "Uploading",
"StartTime": 1595982984.068,
"EndTime": 1595982989.994,
"StatusMessage": "Uploading generated training model"
},
Here's what I've tried so far:
What bothers me is that there's no error log I can see. If I could be directed to that I would be happy too, but if there's some hidden Sagemaker kungfu that I don't know about I would be forever grateful.
EDIT
The training job runs and prints to both the Jupyter cell and CloudWatch as expected. I've since lost the cell output in the notebook but below is the last few lines in CloudWatch. The first number is the epoch and the rest are various custom model metrics.
Upvotes: 2
Views: 4393
Reputation: 513
Can you verify from the training job logs that your training script is running? It doesn't look like your Docker image would respond to the command train
, which is what SageMaker requires, and so I suspect that your model isn't actually getting trained/saved to /opt/ml/model
.
AWS documentation about how SageMaker runs the Docker container: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html
edit: summarizing from the comments below - the training script must also save the model to /opt/ml/model
(the model isn't saved automatically).
Upvotes: 1
Reputation: 19
Have you tried saving to a local file and moving it to S3? I would save it locally (to the root directory of the script) and upload it via boto3.
The sagemaker session object may not have a bucket attributes initialized. Doing it explicitly isn't much an extra step.
import boto3
s3 = boto3.client('s3')
with open("FILE_NAME", "rb") as f:
s3.upload_fileobj(f, "BUCKET_NAME", "DESTINATION_NAME(optional)")
Upvotes: 0