Reputation: 2785
I have created a training job yesterday, same as usual, just adding few more training data. I didn't have any problem with this in the last 2 years (the same exact procedure and code). This time after 14 hours more or less simply stalled. Training job is still "in processing", but cloudwatch is not logging anything since then. Right now 8 more hours passed and no new entry is in the logs, no errors no crash. Can someone explain this ? Unfortunately I don't have any AWS support plan. As you can see from the picture below after 11am there is nothing..
The training job is supposed to complete in the next couple of hours, but now I'm not sure if is actually running (in this case would be a cloudwatch problem) or not..
UPDATE
Suddenly the training job failed, without any further log. The reason is
ClientError: Artifact upload failed:Error 7: The credentials received have been expired
But there is still nothing in the logs after 11am. Very weird.
Upvotes: 1
Views: 2303
Reputation: 2785
For future readers I can confirm that is something that can happen very rarely (I' haven't experienced it anymore since then), but it's AWS fault. Same data, same algorithm.
Upvotes: 1