TheMP
TheMP

Reputation: 8427

No space left on device in Sagemaker model training

I'm using custom algorithm running shipped with Docker image on p2 instance with AWS Sagemaker (a bit similar to https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)

At the end of training process, I try to write down my model to output directory, that is mounted via Sagemaker (like in tutorial), like this:

model_path = "/opt/ml/model"
model.save(os.path.join(model_path, 'model.h5'))

Unluckily, apparently the model gets too big with time and I get the following error:

RuntimeError: Problems closing file (file write failed: time = Thu Jul 26 00:24:48 2018

00:24:49 , filename = 'model.h5', file descriptor = 22, errno = 28, error message = 'No space left on device', buf = 0x1a41d7d0, total write[...]

So all my hours of GPU time are wasted. How can I prevent this from happening again? Does anyone know what is the size limit for model that I store on Sagemaker/mounted directories?

Upvotes: 3

Views: 4529

Answers (2)

Trenton
Trenton

Reputation: 11996

When you train a model with Estimators, it defaults to 30 GB of storage, which may not be enough. You can use the train_volume_size param on the constructor to increase this value. Try with a large-ish number (like 100GB) and see how big your model is. In subsequent jobs, you can tune down the value to something closer to what you actually need.

Storage costs $0.14 per GB-month of provisioned storage. Partial usage is prorated, so giving yourself some extra room is a cheap insurance policy against running out of storage.

Upvotes: 1

Lakshay Sharma
Lakshay Sharma

Reputation: 877

In the SageMaker Jupyter notebook, you can check free space on the filesystem(s) by running !df -h. For a specific path, try something like !df -h /opt.

Upvotes: 1

Related Questions