Reputation: 11374
There's a similar SO question: Tensorflow on ML Engine: The replica master 0 exited with a non-zero status of 1
But here, I'm encountering error "127" instead. Similar to that question, I launched a pytorch custom training container on AI Platform (previously ML Engine) and after about 2 minutes I get the error message "The replica master 0 exited with a non-zero status of 127".
The documentation here doesn't quite say what "127" means: https://cloud.google.com/ai-platform/training/docs/troubleshooting#understanding_training_application_return_codes
Anyone have an idea?
Upvotes: 3
Views: 2906
Reputation: 1
make sure the architecture of your docker image is compatable with vertexai worker node, which uses amd64 architecture, for example if you build on mac M2 it is incompatible:
docker buildx create --use
docker buildx build --platform linux/amd64 -t "name of your image on google cloud"
Upvotes: 0
Reputation: 490
In my case, the problem was that I was using CMD
instead of ENTRYPOINT
in the Dockerfile.
Let's use ENTRYPOINT
like this document: Train an ML model with custom containers
#CMD ["python", "trainer/mnist.py"]
# failed -> the replica master 0 exited with a non-zero status of 127
# Try ENTRYPOINT!
ENTRYPOINT ["python", "trainer/mnist.py"]
This solution may not be the cause in your case though, It may be good idea to check if the cause is a Dockerfile or not 🙂 It may be useful to check the differences between the sample Dockerfile in the above link and your own Dockerfile.
Upvotes: 5