Reputation: 11928
This is a hard situation to describe.
I have a python model train script at:
myproject/opt/program/train
This gets a file at ./opt/ml/input/data/external/train.csv
When I do python3 opt/program/train
the training runs fine locally.
Then I containerize the project and copy opt
to /opt
in my Dockerfile.
Now when I run docker run <image name> train
it also trains fine.
Then I deploy the image to SageMaker, create an estimator, and call model.fit(my_data)
I get:
Exception during training: [Errno 2] File b'./opt/ml/input/data/external/train.csv' does not exist
It's definitely there, I was able to train by running the container myself. Also running the container and exploring the file system I can find the file.
So I think I have some filesystem misunderstanding. From the root of the container, all of these seem to have equivalent outputs.
root@798ffe7364c6:/# ls opt
ml program
root@798ffe7364c6:/# ls /opt
ml program
root@798ffe7364c6:/# ls ./opt
ml program
I'm trying to come up with a way to have one path that will work locally, in the container, and on AWS.
Upvotes: 0
Views: 3448
Reputation: 11928
I was missing the fact that SageMaker looks for your data channels in S3 and copies those to your container at /opt/ml/input/data
By default it seems to use training
and validation
as the channel names. Therefore, in my example above, it would have never copied data from my external
folder on S3 to the right external
folder in my container. In fact, I discovered it was copying it instead to /opt/ml/input/data/training/external/train.csv
.
To resolve this, I would have either had to change my folder names, or use InputDataConfig
to define other channels. I chose the later and was able to get it working.
More info on InputDataConfig
here: https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html
Upvotes: 3