Hack-R
Hack-R

Reputation: 23200

Use images in s3 with SageMaker without .lst files

I am trying to create (what I thought was) a simple image classification pipeline between s3 and SageMaker.

Images are stored in an s3 bucket with their class labels in their file names currently, e.g.

My-s3-bucket-dir

cat-1.jpg
dog-1.jpg
cat-2.jpg
..

I've been trying to leverage several related example .py scripts, but most seem to be download data sets already in .rec format or containing special manifest or annotation files I don't have.

All I want is to pass the images from s3 to the SageMaker image classification algorithm that's located in the same region, IAM account, etc. I suppose this means I need a .lst file

When I try to manually create the .lst it doesn't seem to like it and it also takes too long doing manual work to be a good practice.

How can I automatically generate the .lst file (or otherwise send the images/classes for training)?

Things I read made it sound like im2rec.py was a solution, but I don't see how. The example I'm working with now is

Image-classification-fulltraining-highlevel.ipynb

but it seems to download the data as .rec,

download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')

which just skips working with the .jpeg files. I found another that converts them to .rec but again it has essentially the .lst already as .json and just converts it.

I have mostly been working in a Python Jupyter notebook within the AWS console (in my browser) but I have also tried using their GUI.

How can I simply and automatically generate the .lst or otherwise get the data/class info into SageMaker without manually creating a .lst file?

Update

It looks like im2py can't be run against s3. You'd have to completely download everything from all s3 buckets into the notebook's storage...

Please note that [...] im2rec.py is running locally, therefore cannot take input from the S3 bucket. To generate the list file, you need to download the data and then use the im2rec tool. - AWS SageMaker Team

Upvotes: 2

Views: 1117

Answers (1)

Olivier Cruchant
Olivier Cruchant

Reputation: 4037

There are 3 options to provide annotated data to the Image Classification algo: (1) packing labels in recordIO files, (2) storing labels in a JSON manifest file ("augmented manifest" option), (3) storing labels in a list file. All options are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html.

Augmented Manifest and .lst files option are quick to do since they just require you to create an annotation file with a usually quick for loop for example. RecordIO requires you to use im2rec.py tool, which is a little more work.

Using .lst files is another option that is reasonably easy: you just need to create annotation them with a quick for loop, like this:

# assuming train_index, train_class, train_pics store the pic index, class and path

with open('train.lst', 'a') as file:
    for index, cl, pic in zip(train_index, train_class, train_pics):
        file.write(str(index) + '\t' + str(cl) + '\t' + pic + '\n')

Upvotes: 2

Related Questions