Ashutosh Kumar
Ashutosh Kumar

Reputation: 109

Sagemaker Script Mode Training: How to import custom modules in training script?

I am trying to use SageMaker script mode for training a model on image data. I have multiple scripts for data preparation, model creation, and training. This is the content of my working directory:

WORKDIR
|-- config
|   |-- hyperparameters.json
|   |-- lossweights.json
|   `-- lr.json
|-- dataset.py
|-- densenet.py
|-- resnet.py
|-- models.py
|-- train.py
|-- imagenet_utils.py
|-- keras_utils.py
|-- utils.py
`-- train.ipynb

The training script is train.py and it makes use of other scripts. To run the training script, I'm using the following code:

bucket='ashutosh-sagemaker'
data_key = 'training'
data_location = 's3://{}/{}'.format(bucket, data_key)
print(data_location)
inputs = {'data':data_location}
print(inputs)

from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='train.py',
                       role=role,
                       train_instance_count=1,
                       train_instance_type='ml.p2.xlarge',
                       framework_version='1.14',
                       py_version='py3',
                       script_mode=True,
                       hyperparameters={
                           'epochs': 10
                       }        
                      )

estimator.fit(inputs)

On running this code, I get the following output:

2020-11-09 10:42:07 Starting - Starting the training job...
2020-11-09 10:42:10 Starting - Launching requested ML instances......
2020-11-09 10:43:24 Starting - Preparing the instances for training.........
2020-11-09 10:44:43 Downloading - Downloading input data....................................
2020-11-09 10:51:08 Training - Downloading the training image...
2020-11-09 10:51:40 Uploading - Uploading generated training model

Traceback (most recent call last):
  File "train.py", line 5, in <module>
    from dataset import WatchDataSet
ModuleNotFoundError: No module named 'dataset'
WARNING: Logging before flag parsing goes to stderr.
E1109 10:51:37.525632 140519531874048 _trainer.py:94] ExecuteUserScriptError:
Command "/usr/local/bin/python3.6 train.py --epochs 10 --model_dir s3://sagemaker-ap-northeast-1-485707876195/tensorflow-training-2020-11-09-10-42-06-234/model"

2020-11-09 10:51:47 Failed - Training job failed

What should I do to remove the ModuleNotFoundError? I tried to look for solutions but didn't find any relevant resources.

The contents of train.py file:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from dataset import WatchDataSet
from models import BCNN
from utils import image_generator, val_image_generator
from utils import BCNNScheduler, LossWeightsModifier
from utils import restore_checkpoint, get_epoch_key

import argparse
from collections import defaultdict
import json
import keras
from keras import backend as K
from keras import optimizers
from keras.backend import tensorflow_backend
from keras.callbacks import LearningRateScheduler, TensorBoard
from math import ceil
import numpy as np
import os
import glob
from sklearn.model_selection import train_test_split

parser = argparse.ArgumentParser()
parser.add_argument('--epochs', type=int, default=100, help='number of epoch of training')
parser.add_argument('--batch_size', type=int, default=32, help='size of the batches')
parser.add_argument('--data', type=str, default=os.environ.get('SM_CHANNEL_DATA'))

opt = parser.parse_args()

def main():

    csv_config_dict = {
        'csv': opt.data + 'train.csv',
        'image_dir': opt.data + 'images',
        'xlabel_column': opt.xlabel_column,
        'brand_column': opt.brand_column,
        'model_column': opt.model_column,
        'ref_column': opt.ref_column,
        'encording': opt.encoding
    }

    dataset = WatchDataSet(
        csv_config_dict=csv_config_dict,
        min_data_ref=opt.min_data_ref
    )

    X, y_c1, y_c2, y_fine = dataset.X, dataset.y_c1, dataset.y_c2, dataset.y_fine
    brand_uniq, model_uniq, ref_uniq = dataset.brand_uniq, dataset.model_uniq, dataset.ref_uniq

    print("ref. shape: ", y_fine.shape)
    print("brand shape: ", y_c1.shape)
    print("model shape: ", y_c2.shape)

    height, width = 224, 224
    channel = 3

    # get pre-trained weights
    if opt.mode == 'dense':
        WEIGHTS_PATH = 'https://github.com/keras-team/keras-applications/releases/download/densenet/densenet121_weights_tf_dim_ordering_tf_kernels.h5'
    elif opt.mode == 'res':
        WEIGHTS_PATH = 'https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels.h5'
    weights_path, current_epoch, checkpoint = restore_checkpoint(opt.ckpt_path, WEIGHTS_PATH)

    # split train/validation
    y_ref_list = np.array([ref_uniq[np.argmax(i)] for i in y_fine])
    index_list = np.array(range(len(X)))
    train_index, test_index, _, _ = train_test_split(index_list, y_ref_list, train_size=0.8, random_state=23, stratify=None)

    print("Train")
    model = None
    bcnn = BCNN(
        height=height,
        width=width,
        channel=channel,
        num_classes=len(ref_uniq),
        coarse1_classes=len(brand_uniq),
        coarse2_classes=len(model_uniq),
        mode=opt.mode
    )

if __name__ == '__main__':
    main()

Upvotes: 3

Views: 1596

Answers (2)

Daniel Wyatt
Daniel Wyatt

Reputation: 1151

This isn't exactly what the questioner asked but if anyone has come here wanting to know how to use custom libraries with SKLearn you can use dependencies as an argument like in the following:

import sagemaker
from sagemaker.sklearn.estimator import SKLearn

sess = sagemaker.Session()
role = sagemkaer.get_execution_role()
model = SKLearn(
                entry_point='training.py',
                role=role,
                instance_type='ml.m5.large',
                sagemaker_session=sess,
                dependencies=['my_custom_file.py']
               )

Upvotes: 1

Olivier Cruchant
Olivier Cruchant

Reputation: 4037

If you don't mind switching from TF 1.14 to TF 1.15.2+, you'll be able to bring a local code directory containing your custom modules to your SageMaker TensorFlow Estimator via the argument source_dir. Your entry point script shall be in that source_dir. Details in the SageMaker TensorFlow doc: https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#use-third-party-libraries

Upvotes: 2

Related Questions