Reputation: 8427
I'm using custom algorithm running shipped with Docker image on p2 instance with AWS Sagemaker (a bit similar to https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)
At the end of training process, I try to write down my model to output directory, that is mounted via Sagemaker (like in tutorial), like this:
model_path = "/opt/ml/model"
model.save(os.path.join(model_path, 'model.h5'))
Unluckily, apparently the model gets too big with time and I get the following error:
RuntimeError: Problems closing file (file write failed: time = Thu Jul 26 00:24:48 2018
00:24:49 , filename = 'model.h5', file descriptor = 22, errno = 28, error message = 'No space left on device', buf = 0x1a41d7d0, total write[...]
So all my hours of GPU time are wasted. How can I prevent this from happening again? Does anyone know what is the size limit for model that I store on Sagemaker/mounted directories?
Upvotes: 3
Views: 4529
Reputation: 11996
When you train a model with Estimators
, it defaults to 30 GB of storage, which may not be enough. You can use the train_volume_size
param on the constructor to increase this value. Try with a large-ish number (like 100GB) and see how big your model is. In subsequent jobs, you can tune down the value to something closer to what you actually need.
Storage costs $0.14 per GB-month of provisioned storage. Partial usage is prorated, so giving yourself some extra room is a cheap insurance policy against running out of storage.
Upvotes: 1
Reputation: 877
In the SageMaker Jupyter notebook, you can check free space on the filesystem(s) by running !df -h
. For a specific path, try something like !df -h /opt
.
Upvotes: 1