MAC
MAC

Reputation: 1523

ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream

I am doing a distributed training using GCP Vertex platform. The model is trained in parallel using 4 GPU's using Pytorch and HuggingFace. After training when I save the model from local container to GCP bucket it throws me the error.

Here is the code:

I launch the train.py this way:

python -m torch.distributed.launch --nproc_per_node 4  train.py

After training is complete I save model files using this. It has 3 files that needs to be saved.

trainer.save_model("model_mlm") #Saves in local directory
subprocess.call('gsutil -o GSUtil:parallel_composite_upload_threshold=0  cp -r /pythonPackage/trainer/model_mlm gs://*****/model_mlm', shell=True, stdout=subprocess.PIPE) #from local to GCP

Error:

ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream; this can happen if a file changes size while being uploaded

And sometimes I get this error:

ResumableUploadAbortException: 409 The object has already been created in an earlier attempt and was overwritten, possibly due to a race condition.

Upvotes: 1

Views: 422

Answers (2)

blake
blake

Reputation: 601

I encountered this issue as well. It appears that this happens when the file contents change whilst rsync is uploading the file. This can happen for large files since file writes are not guaranteed to be transactional.

I got around the issue by simply retrying the gsutil rsync command.

Upvotes: 0

Sandeep Vokkareni
Sandeep Vokkareni

Reputation: 1675

As per the documentation name conflict, you are trying to overwrite a file that has already been created.

So I would recommand you to change the destiny location with a unique identifier per training so you don't receive this type of error. For example, adding the timestamp in string format at the end of your bucket like:

- gs://pypl_bkt_prd_row_std_aiml_vertexai/model_mlm_vocab_exp2_50epocs/20220407150000

I would like to mention that this kind of error is retryable as mentioned in the error documentation error docs.

Upvotes: 0

Related Questions