xtian
xtian

Reputation: 118

AWS Sagemaker - ClientError: Data download failed:Could not download

I encountered and error when I deploy my training job in my notebook instance. This what it says: "UnexpectedStatusException: Error for Training job tensorflow-training-2021-01-26-09-55-05-768: Failed. Reason: ClientError: Data download failed:Could not download s3://forex-model-data/data/train2001_2020.npz: insufficient disk space"

I deploy training job to try running it to different instances in 3 epoch. I use ml.c5.4xlarge, ml.c5.18xlarge, ml.m5.24xlarge, also I have two sets of training data, train2001_2020.npz and train2016_2020.npz.

First, I run train2001_2020 to ml.c5.18xlarge and ml.c5.18xlarge and the training job completed, then I switch to train2016_2020 and run it to ml.c5.4xlarge and ml.c5.18xlarge and it goes well. Then when I tried to run it using ml.m5.24xlarge I got an error (quoted above), but my dataset is train2016_2020 not train2001_2020 then when I rerun it again with all other instances it has the same error. What happen?

I stopped the instances and refresh everything, but I encountered same issue.

Upvotes: 0

Views: 1878

Answers (1)

rok
rok

Reputation: 2785

It's not really clear to all the test are you doing, but that error usually means that there is not enough disk space on the instance you are using for the training job. You can try to increase the additional storage for the instance (you can do in the estimator parameters if you are using the sagemaker SDK in a notebook).

Upvotes: 1

Related Questions