rok
rok

Reputation: 2785

AWS Sagemaker training job stuck in progress state

I have created a training job yesterday, same as usual, just adding few more training data. I didn't have any problem with this in the last 2 years (the same exact procedure and code). This time after 14 hours more or less simply stalled. Training job is still "in processing", but cloudwatch is not logging anything since then. Right now 8 more hours passed and no new entry is in the logs, no errors no crash. Can someone explain this ? Unfortunately I don't have any AWS support plan. As you can see from the picture below after 11am there is nothing..

enter image description here

The training job is supposed to complete in the next couple of hours, but now I'm not sure if is actually running (in this case would be a cloudwatch problem) or not..

UPDATE

Suddenly the training job failed, without any further log. The reason is

ClientError: Artifact upload failed:Error 7: The credentials received have been expired

But there is still nothing in the logs after 11am. Very weird.

Upvotes: 1

Views: 2303

Answers (1)

rok
rok

Reputation: 2785

For future readers I can confirm that is something that can happen very rarely (I' haven't experienced it anymore since then), but it's AWS fault. Same data, same algorithm.

Upvotes: 1

Related Questions