Reputation: 23200
I am trying to create (what I thought was) a simple image classification pipeline between s3 and SageMaker.
Images are stored in an s3 bucket with their class labels in their file names currently, e.g.
My-s3-bucket-dir
cat-1.jpg
dog-1.jpg
cat-2.jpg
..
I've been trying to leverage several related example .py scripts, but most seem to be download data sets already in .rec format or containing special manifest or annotation files I don't have.
All I want is to pass the images from s3 to the SageMaker image classification algorithm that's located in the same region, IAM account, etc. I suppose this means I need a .lst
file
When I try to manually create the .lst
it doesn't seem to like it and it also takes too long doing manual work to be a good practice.
How can I automatically generate the .lst
file (or otherwise send the images/classes for training)?
Things I read made it sound like im2rec.py
was a solution, but I don't see how. The example I'm working with now is
Image-classification-fulltraining-highlevel.ipynb
but it seems to download the data as .rec
,
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
which just skips working with the .jpeg files. I found another that converts them to .rec
but again it has essentially the .lst
already as .json
and just converts it.
I have mostly been working in a Python Jupyter notebook within the AWS console (in my browser) but I have also tried using their GUI.
How can I simply and automatically generate the .lst
or otherwise get the data/class info into SageMaker without manually creating a .lst
file?
Update
It looks like im2py can't be run against s3. You'd have to completely download everything from all s3 buckets into the notebook's storage...
Please note that [...] im2rec.py is running locally, therefore cannot take input from the S3 bucket. To generate the list file, you need to download the data and then use the im2rec tool. - AWS SageMaker Team
Upvotes: 2
Views: 1117
Reputation: 4037
There are 3 options to provide annotated data to the Image Classification algo: (1) packing labels in recordIO files, (2) storing labels in a JSON manifest file ("augmented manifest" option), (3) storing labels in a list file. All options are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html.
Augmented Manifest and .lst files option are quick to do since they just require you to create an annotation file with a usually quick for
loop for example. RecordIO requires you to use im2rec.py
tool, which is a little more work.
Using .lst files is another option that is reasonably easy: you just need to create annotation them with a quick for loop, like this:
# assuming train_index, train_class, train_pics store the pic index, class and path
with open('train.lst', 'a') as file:
for index, cl, pic in zip(train_index, train_class, train_pics):
file.write(str(index) + '\t' + str(cl) + '\t' + pic + '\n')
Upvotes: 2