Reputation: 109
I am trying to use SageMaker script mode for training a model on image data. I have multiple scripts for data preparation, model creation, and training. This is the content of my working directory:
WORKDIR
|-- config
| |-- hyperparameters.json
| |-- lossweights.json
| `-- lr.json
|-- dataset.py
|-- densenet.py
|-- resnet.py
|-- models.py
|-- train.py
|-- imagenet_utils.py
|-- keras_utils.py
|-- utils.py
`-- train.ipynb
The training script is train.py
and it makes use of other scripts. To run the training script, I'm using the following code:
bucket='ashutosh-sagemaker'
data_key = 'training'
data_location = 's3://{}/{}'.format(bucket, data_key)
print(data_location)
inputs = {'data':data_location}
print(inputs)
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(entry_point='train.py',
role=role,
train_instance_count=1,
train_instance_type='ml.p2.xlarge',
framework_version='1.14',
py_version='py3',
script_mode=True,
hyperparameters={
'epochs': 10
}
)
estimator.fit(inputs)
On running this code, I get the following output:
2020-11-09 10:42:07 Starting - Starting the training job...
2020-11-09 10:42:10 Starting - Launching requested ML instances......
2020-11-09 10:43:24 Starting - Preparing the instances for training.........
2020-11-09 10:44:43 Downloading - Downloading input data....................................
2020-11-09 10:51:08 Training - Downloading the training image...
2020-11-09 10:51:40 Uploading - Uploading generated training model
Traceback (most recent call last):
File "train.py", line 5, in <module>
from dataset import WatchDataSet
ModuleNotFoundError: No module named 'dataset'
WARNING: Logging before flag parsing goes to stderr.
E1109 10:51:37.525632 140519531874048 _trainer.py:94] ExecuteUserScriptError:
Command "/usr/local/bin/python3.6 train.py --epochs 10 --model_dir s3://sagemaker-ap-northeast-1-485707876195/tensorflow-training-2020-11-09-10-42-06-234/model"
2020-11-09 10:51:47 Failed - Training job failed
What should I do to remove the ModuleNotFoundError
? I tried to look for solutions but didn't find any relevant resources.
The contents of train.py
file:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from dataset import WatchDataSet
from models import BCNN
from utils import image_generator, val_image_generator
from utils import BCNNScheduler, LossWeightsModifier
from utils import restore_checkpoint, get_epoch_key
import argparse
from collections import defaultdict
import json
import keras
from keras import backend as K
from keras import optimizers
from keras.backend import tensorflow_backend
from keras.callbacks import LearningRateScheduler, TensorBoard
from math import ceil
import numpy as np
import os
import glob
from sklearn.model_selection import train_test_split
parser = argparse.ArgumentParser()
parser.add_argument('--epochs', type=int, default=100, help='number of epoch of training')
parser.add_argument('--batch_size', type=int, default=32, help='size of the batches')
parser.add_argument('--data', type=str, default=os.environ.get('SM_CHANNEL_DATA'))
opt = parser.parse_args()
def main():
csv_config_dict = {
'csv': opt.data + 'train.csv',
'image_dir': opt.data + 'images',
'xlabel_column': opt.xlabel_column,
'brand_column': opt.brand_column,
'model_column': opt.model_column,
'ref_column': opt.ref_column,
'encording': opt.encoding
}
dataset = WatchDataSet(
csv_config_dict=csv_config_dict,
min_data_ref=opt.min_data_ref
)
X, y_c1, y_c2, y_fine = dataset.X, dataset.y_c1, dataset.y_c2, dataset.y_fine
brand_uniq, model_uniq, ref_uniq = dataset.brand_uniq, dataset.model_uniq, dataset.ref_uniq
print("ref. shape: ", y_fine.shape)
print("brand shape: ", y_c1.shape)
print("model shape: ", y_c2.shape)
height, width = 224, 224
channel = 3
# get pre-trained weights
if opt.mode == 'dense':
WEIGHTS_PATH = 'https://github.com/keras-team/keras-applications/releases/download/densenet/densenet121_weights_tf_dim_ordering_tf_kernels.h5'
elif opt.mode == 'res':
WEIGHTS_PATH = 'https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels.h5'
weights_path, current_epoch, checkpoint = restore_checkpoint(opt.ckpt_path, WEIGHTS_PATH)
# split train/validation
y_ref_list = np.array([ref_uniq[np.argmax(i)] for i in y_fine])
index_list = np.array(range(len(X)))
train_index, test_index, _, _ = train_test_split(index_list, y_ref_list, train_size=0.8, random_state=23, stratify=None)
print("Train")
model = None
bcnn = BCNN(
height=height,
width=width,
channel=channel,
num_classes=len(ref_uniq),
coarse1_classes=len(brand_uniq),
coarse2_classes=len(model_uniq),
mode=opt.mode
)
if __name__ == '__main__':
main()
Upvotes: 3
Views: 1596
Reputation: 1151
This isn't exactly what the questioner asked but if anyone has come here wanting to know how to use custom libraries with SKLearn you can use dependencies
as an argument like in the following:
import sagemaker
from sagemaker.sklearn.estimator import SKLearn
sess = sagemaker.Session()
role = sagemkaer.get_execution_role()
model = SKLearn(
entry_point='training.py',
role=role,
instance_type='ml.m5.large',
sagemaker_session=sess,
dependencies=['my_custom_file.py']
)
Upvotes: 1
Reputation: 4037
If you don't mind switching from TF 1.14 to TF 1.15.2+, you'll be able to bring a local code directory containing your custom modules to your SageMaker TensorFlow Estimator via the argument source_dir
. Your entry point script shall be in that source_dir
. Details in the SageMaker TensorFlow doc: https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#use-third-party-libraries
Upvotes: 2