DFenstermacher
DFenstermacher

Reputation: 564

AWS NoCredentials in training

I am attempting to run the example code for Amazon Sagemaker on a local GPU. I have copied the code from the Jupyter notebook to the following Python script:

import boto3
import subprocess
import sagemaker
from sagemaker.mxnet import MXNet
from mxnet import gluon
from sagemaker import get_execution_role
import os

sagemaker_session = sagemaker.Session()
instance_type = 'local'
if subprocess.call('nvidia-smi') == 0:
    # Set type to GPU if one is present
    instance_type = 'local_gpu'
# role = get_execution_role()

gluon.data.vision.MNIST('./data/train', train=True)
gluon.data.vision.MNIST('./data/test', train=False)

# successfully connects and uploads data
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/mnist')

hyperparameters = {
    'batch_size': 100,
    'epochs': 20,
    'learning_rate': 0.1,
    'momentum': 0.9,
    'log_interval': 100
}

m = MXNet("mnist.py",
          role=role,
          train_instance_count=1,
          train_instance_type=instance_type,
          framework_version="1.1.0",
          hyperparameters=hyperparameters)

# fails in Docker container
m.fit(inputs)
predictor = m.deploy(initial_instance_count=1, instance_type=instance_type)
m.delete_endpoint()

where the referenced mnist.py file is exactly as specified on Github. The script fails on m.fit in Docker container with the following error:

algo-1-1DUU4_1  | Downloading s3://<S3-BUCKET>/sagemaker-mxnet-2018-10-07-00-47-10-435/source/sourcedir.tar.gz to /tmp/script.tar.gz
algo-1-1DUU4_1  | 2018-10-07 00:47:29,219 ERROR - container_support.training - uncaught exception during training: Unable to locate credentials
algo-1-1DUU4_1  | Traceback (most recent call last):
algo-1-1DUU4_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 36, in start
algo-1-1DUU4_1  |     fw.train()
algo-1-1DUU4_1  |   File "/usr/local/lib/python2.7/dist-packages/mxnet_container/train.py", line 169, in train
algo-1-1DUU4_1  |     mxnet_env.download_user_module()
algo-1-1DUU4_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/environment.py", line 89, in download_user_module
algo-1-1DUU4_1  |     cs.download_s3_resource(self.user_script_archive, tmp)
algo-1-1DUU4_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/utils.py", line 37, in download_s3_resource
algo-1-1DUU4_1  |     script_bucket.download_file(script_key_name, target)
algo-1-1DUU4_1  |   File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 246, in bucket_download_file
algo-1-1DUU4_1  |     ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
algo-1-1DUU4_1  |   File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 172, in download_file
algo-1-1DUU4_1  |     extra_args=ExtraArgs, callback=Callback)
algo-1-1DUU4_1  |   File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 307, in download_file
algo-1-1DUU4_1  |     future.result()
algo-1-1DUU4_1  |   File "/usr/local/lib/python2.7/dist-packages/s3transfer/futures.py", line 73, in result
algo-1-1DUU4_1  |     return self._coordinator.result()
algo-1-1DUU4_1  |   File "/usr/local/lib/python2.7/dist-packages/s3transfer/futures.py", line 233, in result
algo-1-1DUU4_1  |     raise self._exception
algo-1-1DUU4_1  | NoCredentialsError: Unable to locate credentials

I am confused that I can authenticate to S3 outside of the container (to pload the training/test data) but I cannot within the Docker container. So I am guessing the issues has to do with passing the AWS credentials to the Docker container. Here is the generated Docker-compose file:

networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-1DUU4:
    command: train
    environment:
    - AWS_REGION=us-west-2
    - TRAINING_JOB_NAME=sagemaker-mxnet-2018-10-07-00-47-10-435
    image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/sagemaker-mxnet:1.1.0-gpu-py2
    networks:
      sagemaker-local:
        aliases:
        - algo-1-1DUU4
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmpSkaR3x/algo-1-1DUU4/input:/opt/ml/input
    - /tmp/tmpSkaR3x/algo-1-1DUU4/output:/opt/ml/output
    - /tmp/tmpSkaR3x/algo-1-1DUU4/output/data:/opt/ml/output/data
    - /tmp/tmpSkaR3x/model:/opt/ml/model
version: '2.1'

Should the AWS credentials be passed in as enviromental variables?

I upgraded my sagemaker install to after reading Using boto3 in install local mode?, but that had no effect. I checked the credentials that are being fetched in the Sagemaker session (outside the container) and they appear to be blank, even though I have an ~/.aws/config and ~/.aws/credentials file:

{'_token': None, '_time_fetcher': <function _local_now at 0x7f4dbbe75230>, '_access_key': None, '_frozen_credentials': None, '_refresh_using': <bound method AssumeRoleCredentialFetcher.fetch_credentials of <botocore.credentials.AssumeRoleCredentialFetcher object at 0x7f4d2de48bd0>>, '_secret_key': None, '_expiry_time': None, 'method': 'assume-role', '_refresh_lock': <thread.lock object at 0x7f4d9f2aafd0>}

I am new to AWS so I do not know how to diagnose the issue regarding AWS credentials. My .aws/config file has the following information (with placeholder values):

[default]
output = json
region = us-west-2
role_arn = arn:aws:iam::123456789012:role/SageMakers
source_profile = sagemaker-test

[profile sagemaker-test]
output = json
region = us-west-2

Where the sagemaker-test profile has AmazonSageMakerFullAccess in the IAM Management Console.

The .aws/credentials file has the following information (represented by placeholder values):

[default]
aws_access_key_id = 1234567890
aws_secret_access_key = zyxwvutsrqponmlkjihgfedcba
[sagemaker-test]
aws_access_key_id = 0987654321
aws_secret_access_key = abcdefghijklmopqrstuvwxyz

Lastly, these are versions of the applicable libraries from a pip freeze:

awscli==1.16.19
boto==2.48.0
boto3==1.9.18
botocore==1.12.18
docker==3.5.0
docker-compose==1.22.0
mxnet-cu91==1.1.0.post0
sagemaker==1.11.1

Please let me know if I left out any relevant information and thanks for any help/feedback that you can provide.

UPDATE: Thanks for your help, everyone! While attempting some of your suggested fixes, I noticed that boto3 was out of date, and update it (to boto3-1.9.26 and botocore-1.12.26) which appeared to resolve the issue. I was not able to find any documentation on that being an issue with boto3==1.9.18. If someone could help me understand what the issue was with boto3, I would happy to make mark their answer as correct.

Upvotes: 3

Views: 2116

Answers (4)

jesterhazy
jesterhazy

Reputation: 238

SageMaker local mode is designed to pick up whatever credentials are available in your boto3 session, and pass them into the docker container as environment variables.

However, the version of the sagemaker sdk that you are using (1.11.1 and earlier) will ignore the credentials if they include a token, because that usually indicates short-lived credentials that won't remain valid long enough for a training job to complete or endpoint to be useful.

If you are using temporary credentials, try replacing them with permanent ones, or running from an ec2 instance (or SageMaker notebook!) that has an appropriate instance role assigned.

Also, the sagemaker sdk's handling of credentials changed in v1.11.2 and later -- temporary credentials will be passed to local mode containers, but with a warning message. So you could just upgrade to a newer version and try again (pip install -U sagemaker).

Also, try upgrading boto3 can change, so try using the latest version.

Upvotes: 1

Rui
Rui

Reputation: 61

I just confirmed that his example works on my machine locally. Please make sure the role you are using has permission to use the buckets with name starts with sagemaker. Sagemaker by default creates buckets prefixed with sagemaker.

Upvotes: 1

UtahJarhead
UtahJarhead

Reputation: 2217

I'll assume that the library you're using has boto3 at its core. boto3 advises that there are several methods of authentication available to you.

  • Passing credentials as parameters in the boto.client() method
  • Passing credentials as parameters when creating a Session object
  • Environment variables
  • Shared credential file (~/.aws/credentials)
  • AWS config file (~/.aws/config)
  • Assume Role provider
  • Boto2 config file (/etc/boto.cfg and ~/.boto)
  • Instance metadata service on an Amazon EC2 instance that has an IAM role configured.

But it sounds like a docker sandbox does not have access to your ~/.aws/credentials.conf file, so I'd consider other options that may be available to you. As I'm unfamiliar with docker, I can't give you a guaranteed solution for your scenario.

Upvotes: 0

Jamie Starke
Jamie Starke

Reputation: 9234

It looks like you have the credentials configured on your host at ~/.aws/credentials but are trying to access them on a docker container running on the host.

The simplest solution seems to be, mounting your aws credentials on the container at the expected location. You appear to be using the sagemaker-mxnet:1.1.0-gpu-py2 image, which appears to use the root user. Based on this, if you update the volumes in your docker-compose file for the algo-1-1DUU4 to include:

volumes:
  ...
  ~/.aws/:/root/.aws/

this will mount your credentials on to the root user in your container, so that your python script should be able to access them.

Upvotes: 0

Related Questions