rok
rok

Reputation: 2775

AWS Sagemaker failure after successful training "ClientError: Artifact upload failed:Insufficient disk space"

I'm training a network using custom docker image. First training with 50.000 steps everythig was ok, when I tried to increase to 80.000, I got error: "ClientError: Artifact upload failed:Insufficient disk space", I just increased the steps number.. this is weird to me. There are no errors in the cloudwatch log, my last entry is:

Successfully generated graphs: ['pipeline.config', 'tflite_graph.pb', 'frozen_inference_graph.pb', 'tflite_graph.pbtxt', 'tflite_quant_graph.tflite', 'saved_model', 'hyperparameters.json', 'label_map.pbtxt', 'model.ckpt.data-00000-of-00001', 'model.ckpt.meta', 'model.ckpt.index', 'checkpoint']

Which basically means that those files have been created because is a simple:

    graph_files = os.listdir(model_path + '/graph')

Which disk space is talking about? Also looking at the training job I see from the disk utilization chart that the rising curve peaks at 80%... I expect that after the successful creation of the aforementioned files, everything is uploaded to my s3 bucket, where no disk space issues are present. Why 50.000 steps is working and 80.000 is not working? It is my understanding that the number of training steps don't influence the size of the model files..

Upvotes: 1

Views: 4081

Answers (2)

Hamza Liaqat
Hamza Liaqat

Reputation: 41

When the Sagemaker training completes, the model from /opt/ml/model directory in container will be uploaded to S3. If the model to be uploaded is too large then that error ClientError: Artifact upload failed:... will be thrown. And, increasing the volume size will fix the problem superficially. But the model in most cases does not have to be that large, right?

Note that the odds are your model itself is not too large but you're saving your checkpoints to /opt/ml/model as well (bug). And, in the end, sagemaker tries to pack everything (model and all checkpoints) in order to upload to S3. Thereby, not having sufficient volume. Hence, the error. You can confirm if this is the reason by checking the size of your uploaded model.tar.gz file on S3.enter image description here

Why 50.000 steps is working and 80.000 is not working?

With 80,000 steps, the number of checkpoints have also increased, and the final model.tar.gz file which is to uploaded on S3 has become too big that it can't even fit in current volume.

Upvotes: 1

rok
rok

Reputation: 2775

Adding volume size to the training job selecting "additional storage volume per instance (gb)" to 5GB on the creation seems to solve the problem. I still don't understand why, but problem seems solved.

Upvotes: 2

Related Questions