Reputation: 564
I am attempting to run the example code for Amazon Sagemaker on a local GPU. I have copied the code from the Jupyter notebook to the following Python script:
import boto3
import subprocess
import sagemaker
from sagemaker.mxnet import MXNet
from mxnet import gluon
from sagemaker import get_execution_role
import os
sagemaker_session = sagemaker.Session()
instance_type = 'local'
if subprocess.call('nvidia-smi') == 0:
# Set type to GPU if one is present
instance_type = 'local_gpu'
# role = get_execution_role()
gluon.data.vision.MNIST('./data/train', train=True)
gluon.data.vision.MNIST('./data/test', train=False)
# successfully connects and uploads data
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/mnist')
hyperparameters = {
'batch_size': 100,
'epochs': 20,
'learning_rate': 0.1,
'momentum': 0.9,
'log_interval': 100
}
m = MXNet("mnist.py",
role=role,
train_instance_count=1,
train_instance_type=instance_type,
framework_version="1.1.0",
hyperparameters=hyperparameters)
# fails in Docker container
m.fit(inputs)
predictor = m.deploy(initial_instance_count=1, instance_type=instance_type)
m.delete_endpoint()
where the referenced mnist.py file is exactly as specified on Github. The script fails on m.fit
in Docker container with the following error:
algo-1-1DUU4_1 | Downloading s3://<S3-BUCKET>/sagemaker-mxnet-2018-10-07-00-47-10-435/source/sourcedir.tar.gz to /tmp/script.tar.gz
algo-1-1DUU4_1 | 2018-10-07 00:47:29,219 ERROR - container_support.training - uncaught exception during training: Unable to locate credentials
algo-1-1DUU4_1 | Traceback (most recent call last):
algo-1-1DUU4_1 | File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 36, in start
algo-1-1DUU4_1 | fw.train()
algo-1-1DUU4_1 | File "/usr/local/lib/python2.7/dist-packages/mxnet_container/train.py", line 169, in train
algo-1-1DUU4_1 | mxnet_env.download_user_module()
algo-1-1DUU4_1 | File "/usr/local/lib/python2.7/dist-packages/container_support/environment.py", line 89, in download_user_module
algo-1-1DUU4_1 | cs.download_s3_resource(self.user_script_archive, tmp)
algo-1-1DUU4_1 | File "/usr/local/lib/python2.7/dist-packages/container_support/utils.py", line 37, in download_s3_resource
algo-1-1DUU4_1 | script_bucket.download_file(script_key_name, target)
algo-1-1DUU4_1 | File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 246, in bucket_download_file
algo-1-1DUU4_1 | ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
algo-1-1DUU4_1 | File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 172, in download_file
algo-1-1DUU4_1 | extra_args=ExtraArgs, callback=Callback)
algo-1-1DUU4_1 | File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 307, in download_file
algo-1-1DUU4_1 | future.result()
algo-1-1DUU4_1 | File "/usr/local/lib/python2.7/dist-packages/s3transfer/futures.py", line 73, in result
algo-1-1DUU4_1 | return self._coordinator.result()
algo-1-1DUU4_1 | File "/usr/local/lib/python2.7/dist-packages/s3transfer/futures.py", line 233, in result
algo-1-1DUU4_1 | raise self._exception
algo-1-1DUU4_1 | NoCredentialsError: Unable to locate credentials
I am confused that I can authenticate to S3 outside of the container (to pload the training/test data) but I cannot within the Docker container. So I am guessing the issues has to do with passing the AWS credentials to the Docker container. Here is the generated Docker-compose file:
networks:
sagemaker-local:
name: sagemaker-local
services:
algo-1-1DUU4:
command: train
environment:
- AWS_REGION=us-west-2
- TRAINING_JOB_NAME=sagemaker-mxnet-2018-10-07-00-47-10-435
image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/sagemaker-mxnet:1.1.0-gpu-py2
networks:
sagemaker-local:
aliases:
- algo-1-1DUU4
stdin_open: true
tty: true
volumes:
- /tmp/tmpSkaR3x/algo-1-1DUU4/input:/opt/ml/input
- /tmp/tmpSkaR3x/algo-1-1DUU4/output:/opt/ml/output
- /tmp/tmpSkaR3x/algo-1-1DUU4/output/data:/opt/ml/output/data
- /tmp/tmpSkaR3x/model:/opt/ml/model
version: '2.1'
Should the AWS credentials be passed in as enviromental variables?
I upgraded my sagemaker
install to after reading Using boto3 in install local mode?, but that had no effect. I checked the credentials that are being fetched in the Sagemaker session (outside the container) and they appear to be blank, even though I have an ~/.aws/config
and ~/.aws/credentials
file:
{'_token': None, '_time_fetcher': <function _local_now at 0x7f4dbbe75230>, '_access_key': None, '_frozen_credentials': None, '_refresh_using': <bound method AssumeRoleCredentialFetcher.fetch_credentials of <botocore.credentials.AssumeRoleCredentialFetcher object at 0x7f4d2de48bd0>>, '_secret_key': None, '_expiry_time': None, 'method': 'assume-role', '_refresh_lock': <thread.lock object at 0x7f4d9f2aafd0>}
I am new to AWS so I do not know how to diagnose the issue regarding AWS credentials. My .aws/config
file has the following information (with placeholder values):
[default]
output = json
region = us-west-2
role_arn = arn:aws:iam::123456789012:role/SageMakers
source_profile = sagemaker-test
[profile sagemaker-test]
output = json
region = us-west-2
Where the sagemaker-test
profile has AmazonSageMakerFullAccess
in the IAM Management Console.
The .aws/credentials
file has the following information (represented by placeholder values):
[default]
aws_access_key_id = 1234567890
aws_secret_access_key = zyxwvutsrqponmlkjihgfedcba
[sagemaker-test]
aws_access_key_id = 0987654321
aws_secret_access_key = abcdefghijklmopqrstuvwxyz
Lastly, these are versions of the applicable libraries from a pip freeze
:
awscli==1.16.19
boto==2.48.0
boto3==1.9.18
botocore==1.12.18
docker==3.5.0
docker-compose==1.22.0
mxnet-cu91==1.1.0.post0
sagemaker==1.11.1
Please let me know if I left out any relevant information and thanks for any help/feedback that you can provide.
UPDATE: Thanks for your help, everyone! While attempting some of your suggested fixes, I noticed that boto3
was out of date, and update it (to boto3-1.9.26
and botocore-1.12.26
) which appeared to resolve the issue. I was not able to find any documentation on that being an issue with boto3==1.9.18
. If someone could help me understand what the issue was with boto3
, I would happy to make mark their answer as correct.
Upvotes: 3
Views: 2116
Reputation: 238
SageMaker local mode is designed to pick up whatever credentials are available in your boto3 session, and pass them into the docker container as environment variables.
However, the version of the sagemaker sdk that you are using (1.11.1 and earlier) will ignore the credentials if they include a token, because that usually indicates short-lived credentials that won't remain valid long enough for a training job to complete or endpoint to be useful.
If you are using temporary credentials, try replacing them with permanent ones, or running from an ec2 instance (or SageMaker notebook!) that has an appropriate instance role assigned.
Also, the sagemaker sdk's handling of credentials changed in v1.11.2 and later -- temporary credentials will be passed to local mode containers, but with a warning message. So you could just upgrade to a newer version and try again (pip install -U sagemaker
).
Also, try upgrading boto3
can change, so try using the latest version.
Upvotes: 1
Reputation: 61
I just confirmed that his example works on my machine locally. Please make sure the role you are using has permission to use the buckets with name starts with sagemaker
. Sagemaker by default creates buckets prefixed with sagemaker
.
Upvotes: 1
Reputation: 2217
I'll assume that the library you're using has boto3
at its core. boto3
advises that there are several methods of authentication available to you.
But it sounds like a docker sandbox does not have access to your ~/.aws/credentials.conf
file, so I'd consider other options that may be available to you. As I'm unfamiliar with docker, I can't give you a guaranteed solution for your scenario.
Upvotes: 0
Reputation: 9234
It looks like you have the credentials configured on your host at ~/.aws/credentials
but are trying to access them on a docker container running on the host.
The simplest solution seems to be, mounting your aws credentials on the container at the expected location. You appear to be using the sagemaker-mxnet:1.1.0-gpu-py2
image, which appears to use the root user. Based on this, if you update the volumes in your docker-compose file for the algo-1-1DUU4
to include:
volumes:
...
~/.aws/:/root/.aws/
this will mount your credentials on to the root user in your container, so that your python script should be able to access them.
Upvotes: 0