Reputation: 188

AWS sagemaker-container: How to create or pass the resourceconfig.json to framework for training?

I am trying to create a custom model/image/container for Amazon Sagemaker. I had read all the basics tutorials, how to create an image with your requirements. Actually i have a properly set image which runs tensorflow , trains, deploy and serve the model locally.

The problems come when i am trying to run the container using sagemaker python SDK. more precisely, trying to use the framework module and Class to create my own custom estimator to run the custom image/container.

here i post the minimum code to explain my case:

File Structure:

.
├── Dockerfile
├── variables.env
├── requirements.txt
├── test_sagemaker.ipynb
├── src
|   ├── train
|   ├── serve
|   ├── predict.py
|   └── custom_code/my_model_functions
|
└── local_test
    ├── train_local.sh
    ├── serve_local.sh
    ├── predict.sh
    └── test_dir
        ├── model/model.pkl
        ├── output/output.txt
        └── input
            ├── data/data.pkl
            └── config
                ├── hyperparameters.json
                ├── inputdataconfig.json
                └── resourceconfig.json

dockerfile.

FROM ubuntu:16.04

MAINTAINER Amazon AI <[email protected]>

# Install python and other runtime dependencies
RUN apt-get update && \
    apt-get -y install build-essential libatlas-dev git wget curl nginx jq && \
    apt-get -y install python3-dev python3-setuptools

# Install pip
RUN cd /tmp && \
    curl -O https://bootstrap.pypa.io/get-pip.py && \
    python3 get-pip.py && \
    rm get-pip.py

# Installing Requirements
COPY requirements.txt /requirements.txt
RUN pip3 install -r /requirements.txt

# Set SageMaker training environment variables
ENV SM_ENV_VARIABLES env_variables

COPY local_test/test_dir /opt/ml

# Set up the program in the image
COPY src /opt/program
WORKDIR /opt/program

Train


from __future__ import absolute_import

import json, sys, logging, os, subprocess, time, traceback
from pprint import pprint

# Custom Code Functions
from custom_code.custom_estimator import CustomEstimator
from custom_code.custom_dataset import create_dataset

# Important Seagemaker Modules
import sagemaker_containers.beta.framework as framework
from sagemaker_containers import _env

logger = logging.getLogger(__name__)

def run_algorithm_mode():
    """Run training in algorithm mode, which does not require a user entry point. """

    train_config = os.environ.get('training_env_variables')
    model_path = os.environ.get("model_path")

    print("Downloading Dataset")
    train_dataset,  test_dataset = create_dataset(None)
    print("Creating Model")
    clf = CustomEstimator.create_model(train_config)
    print("Starting Training")
    clf = clf.train_model(train_dataset, test_dataset)
    print("Saving Model")
    module_name = 'classifier.pkl'
    CustomEstimator.save_model(clf, model_path)


def train(training_environment):
    """Run Custom Model training in either 'algorithm mode' or using a user supplied module in local SageMaker environment.
    The user supplied module and its dependencies are downloaded from S3.
    Training is invoked by calling a "train" function in the user supplied module.
    Args:
        training_environment: training environment object containing environment variables,
                               training arguments and hyperparameters
    """

    if training_environment.user_entry_point is not None:
        print("Entry Point Receive")
        framework.modules.run_module(training_environment.module_dir,
                                     training_environment.to_cmd_args(),
                                     training_environment.to_env_vars(),
                                     training_environment.module_name,
                                     capture_error=False)
        print_directories()
    else:
        logger.info("Running Custom Model Sagemaker in 'algorithm mode'")
        try:
            _env.write_env_vars(training_environment.to_env_vars())
        except Exception as error:
            print(error)
        run_algorithm_mode()

def main():
    train(framework.training_env())
    sys.exit(0)

if __name__ == '__main__':
    main()

test_sagemaker.ipynb

I created this custom sagemaker estimator using the Framework class of the sagemaker estimator.

import boto3
from sagemaker.estimator import Framework

class ScriptModeTensorFlow(Framework):
    """This class is temporary until the final version of Script Mode is released.
    """

    __framework_name__ = "tensorflow-scriptmode"

    create_model = TensorFlow.create_model

    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        image_name=None,
        **kwargs
    ):
        super(ScriptModeTensorFlow, self).__init__(
            entry_point, source_dir , hyperparameters, image_name=image_name, **kwargs
        )
        self.py_version = py_version
        self.image_name = None
        self.framework_version = '2.0.0'
        self.user_entry_point = entry_point
        print(self.user_entry_point)

Then create the estimator passing the entry_point and the images (all the others parameters the class needs to run.)

estimator = ScriptModeTensorFlow(entry_point='training_script_path/train_model.py',
                       image_name='sagemaker-custom-image:latest',
                       source_dir='source_dir_path/input/config',
                       train_instance_type='local',      # Run in local mode
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       py_version='py3',
                       role=role)

Finally, hitting training...

estimator.fit({"train": "s3://s3-bucket-path/training_data"})

but I get the following error:

Creating tmpm3ft7ijm_algo-1-mjqkd_1 ... 
Attaching to tmpm3ft7ijm_algo-1-mjqkd_12mdone
algo-1-mjqkd_1  | Reporting training FAILURE
algo-1-mjqkd_1  | framework error: 
algo-1-mjqkd_1  | Traceback (most recent call last):
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 65, in train
algo-1-mjqkd_1  |     env = sagemaker_containers.training_env()
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/__init__.py", line 27, in training_env
algo-1-mjqkd_1  |     resource_config=_env.read_resource_config(),
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 240, in read_resource_config
algo-1-mjqkd_1  |     return _read_json(resource_config_file_dir)
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 192, in _read_json
algo-1-mjqkd_1  |     with open(path, "r") as f:
algo-1-mjqkd_1  | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1  | 
algo-1-mjqkd_1  | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1  | Traceback (most recent call last):
algo-1-mjqkd_1  |   File "/usr/local/bin/dockerd-entrypoint.py", line 24, in <module>
algo-1-mjqkd_1  |     subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
algo-1-mjqkd_1  |   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
algo-1-mjqkd_1  |     raise CalledProcessError(retcode, cmd)
algo-1-mjqkd_1  | subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 2.
tmpm3ft7ijm_algo-1-mjqkd_1 exited with code 1
Aborting on container exit...

At first glance the error seems obvious, the file '/opt/ml/input/config/resourceconfig.json' is missing. The thing is I have no way of creating this file so that sagemaker framework can get the host for multiprocessing (whcih i don t need them yeet). When I am creating the image 'sagemaker-custom-image:latest' following the folder structure show bellow, I already give the 'resoruceconfig.json' to the '/opt/ml/input/config/' folder inside the image.

/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   ├── inputdataconfig.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
│   └── <model files>
└── output
    └── failure

Reading the documentation in AWS, when using sagemaker sdk to run your image, it says that all the files in the container in the folder 'opt/ml' may no longer be visibles during training.

/opt/ml and all sub-directories are reserved by Amazon SageMaker training. When building your algorithm’s docker image, please ensure you don't place any data required by your algorithm under them as the data may no longer be visible during training.How Amazon SageMaker Runs Your Training Image

This basically resumes my problem.

Yes, I know I can make use of the prebuilt estimators and images from sagemaker.

Yes, I know I can bypass the framwork library and run the image train from docker run.

But i have the need to implement a fully custom sagemaker sdk/image/container/model to use with entrypoint. I know is a bit ambitious.

So to Reformulate my question: How do I get Sagemaker Framework or SDK to create inside the image the require resourceconfig.json file?

Upvotes: 3

AWS sagemaker-container: How to create or pass the resourceconfig.json to framework for training?

Answers (2)

Related Questions