Reputation: 23214
Problem:
Jobs repeatedly fail after 5 minutes with the error
ClientError: .lst file missing in the train_lst channel.
Context:
Working within the AWS console, I have a binary classification task of images. I have labeled the classes in their filenames, per a guide.
Eventually I started hitting errors that revealed that for this particular algorithm a .lst
file is required for gathering the labels, since "Content Type" is specified as image, which apparently requires a lst file.
Example Data:
I am trying to match the examples I see on StackOverflow and elsewhere online. The current iteration of trn_list.lst
looks like this:
292 \t 1 \t dog-292.jpeg
214 \t 1 \t dog-214.jpeg
290 \t 0 \t cat-290.jpeg
288 \t 1 \t dog-288.jpeg
160 \t 1 \t dog-160.jpeg
18 \t 0 \t cat-18.jpeg
215 \t 1 \t dog-215.jpeg
254 \t 1 \t dog-254.jpeg
53 \t 1 \t dog-53.jpeg
337 \t 0 \t cat-337.jpeg
284 \t 0 \t cat-284.jpeg
177 \t 1 \t dog-177.jpeg
192 \t 1 \t dog-192.jpeg
228 \t 0 \t cat-228.jpeg
305 \t 0 \t cat-305.jpeg
258 \t 1 \t dog-258.jpeg
75 \t 0 \t cat-75.jpeg
148 \t 0 \t cat-148.jpeg
268 \t 1 \t dog-268.jpeg
281 \t 1 \t dog-281.jpeg
24 \t 1 \t dog-24.jpeg
328 \t 1 \t dog-328.jpeg
99 \t 1 \t dog-99.jpeg
The bucket has no sub-folders, so I just put the .lst on the
In one iteration I allowed my R program that creates the .lst to replace the \t
with actual tabs when it writes it out. In other iterations I left the actual delimiters (\t
) in there. Didn't seem to affect it (?).
Upvotes: 0
Views: 800
Reputation: 12891
When you are using SageMaker training jobs you are actually deploying a Docker image to a cluster of EC2 instances. The Docker has a python file that is running the training code in a similar way that you train it on your machine. In the training code you are referring to local folders when it expects to find the data such as the images to train on and the meta-data to use for that training.
The "magic" is how to get the data from S3 to be available locally for the training instances. This is done using the definition of the channels in your training job configuration. Each channel definition creates a local folder on the training instance and copies the data from S3 to that local folder. You need to match the names and the S3 location and file formats.
Here is the documentation of the definition of a channel in SageMaker: https://docs.aws.amazon.com/sagemaker/latest/dg/API_Channel.html
For the specific example of the built-in algorithm for image classification and if you use the Image format for training, specify train
, validation
, train_lst
, and validation_lst
channels as values for the InputDataConfig
parameter of the CreateTrainingJob
request. Specify the individual image data (.jpg or .png files) for the train and validation channels. Specify one .lst file in each of the train_lst and validation_lst channels. Set the content type for all four channels to application/x-image
.
See more details here: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html#IC-inputoutput
Upvotes: 3