Reputation: 1794
I'm running a training job on SageMaker. The job doesn't fully complete and hits the MaxRuntimeInSeconds stopping condition. When the job is stopping, documentation says the artifact will still be saved. I've attached the status progression of my training job below. It looks like the training job finished correctly. However the output S3 folder is empty. Any ideas on what is going wrong here? The training data is located in the same bucket so it should have everything it needs.
Upvotes: 0
Views: 1756
Reputation: 6779
If MaxRuntimeInSeconds
is exceeded then model upload is only best-effort and really depends on whether the algorithm saved any state to /opt/ml/model
at all prior to being terminated.
The two minute wait period between 15:33 to 15:35 in the Stopping
stage signifies the max time between a SIGTERM
and a SIGKILL
signal sent to your algorithm (see SageMaker doc for more detail). If your algorithm traps the SIGTERM it is supposed to use that as a signal to gracefully save its work and shutdown before the SageMaker platform kills it forcibly with a SIGKILL signal 2 minutes later.
Given that the wait period in the Stopping
step is exactly 2 minutes as well as the fact Uploading
step started at 15:35 and completed almost immediately at 15:35 it's likely that your algo did not take advantage of the SIGTERM warning and that there was nothing saved to /opt/ml/model
. To give you a definitive answer as to whether this was indeed the case please create a SageMaker forum post and the SageMaker team can private-message you to gather details of your job.
Upvotes: 0
Reputation: 1
From the status progression, it seems that the training image download completed at 15:33 UTC and by that time the stopping condition was initiated based on the MaxRuntimeInSeconds
parameter that you have specified. From then, it takes 2 mins (15:33 to 15:35) to save any available model artifact but in your case, the training process did not happen at all. The only thing that was done was downloading the pre-built image(containing the ML algorithm). Please refer the following lines from the documentation which says model being saved is subject to the state the training process is in. May be you can try to increase the MaxRuntimeInSeconds and run the job again. Also, please check MaxWaitTimeInSeconds
value that you have set if you have.It must be equal to or greater than MaxRuntimeInSeconds
.
Please find the excerpts from AWS documentation :
"The training algorithms provided by Amazon SageMaker automatically save the intermediate results of a model training job when possible. This attempt to save artifacts is only a best effort case as model might not be in a state from which it can be saved. For example, if training has just started, the model might not be ready to save."
Upvotes: 0