Reputation: 103
I was trying to train the YOLOv5x6 model with 300 epochs on a Google Pro+ instance. Unfortunately, after running for almost 20+ hours, the training halted at 250th epoch without indicating any error/information/warning. Any idea what went wrong? Before giving another try, I'd like to know what could have caused this issue. Is there a way to continue the training from where it left off?
GPU: Tesla P100-PCIE-16GB, 16280.875MB Runtime shape: Standard
Upvotes: 1
Views: 705
Reputation: 648
Google colab pro+ still have a 24h total runtime on a VM.
One approach you can try is to save the state of your training each X iteration and upload it to google drive or other cloud service (or download it to your local machine).
Then, you restart the notebook but charging the last state of the training.
Upvotes: 2