Frankie
Frankie

Reputation: 11928

Sagemaker can't find paths in container

This is a hard situation to describe.

I have a python model train script at:

myproject/opt/program/train

This gets a file at ./opt/ml/input/data/external/train.csv

When I do python3 opt/program/train the training runs fine locally.

Then I containerize the project and copy opt to /opt in my Dockerfile.

Now when I run docker run <image name> train it also trains fine.

Then I deploy the image to SageMaker, create an estimator, and call model.fit(my_data) I get:

Exception during training: [Errno 2] File b'./opt/ml/input/data/external/train.csv' does not exist

It's definitely there, I was able to train by running the container myself. Also running the container and exploring the file system I can find the file.

So I think I have some filesystem misunderstanding. From the root of the container, all of these seem to have equivalent outputs.

root@798ffe7364c6:/# ls opt
ml  program
root@798ffe7364c6:/# ls /opt
ml  program
root@798ffe7364c6:/# ls ./opt
ml  program

I'm trying to come up with a way to have one path that will work locally, in the container, and on AWS.

Upvotes: 0

Views: 3448

Answers (1)

Frankie
Frankie

Reputation: 11928

I was missing the fact that SageMaker looks for your data channels in S3 and copies those to your container at /opt/ml/input/data

By default it seems to use training and validation as the channel names. Therefore, in my example above, it would have never copied data from my external folder on S3 to the right external folder in my container. In fact, I discovered it was copying it instead to /opt/ml/input/data/training/external/train.csv.

To resolve this, I would have either had to change my folder names, or use InputDataConfig to define other channels. I chose the later and was able to get it working.

More info on InputDataConfig here: https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html

Upvotes: 3

Related Questions