foobar
foobar

Reputation: 11374

Google AI Platform: The replica master 0 exited with a non-zero status of 127

There's a similar SO question: Tensorflow on ML Engine: The replica master 0 exited with a non-zero status of 1

But here, I'm encountering error "127" instead. Similar to that question, I launched a pytorch custom training container on AI Platform (previously ML Engine) and after about 2 minutes I get the error message "The replica master 0 exited with a non-zero status of 127".

The documentation here doesn't quite say what "127" means: https://cloud.google.com/ai-platform/training/docs/troubleshooting#understanding_training_application_return_codes

Anyone have an idea?

Upvotes: 3

Views: 2906

Answers (2)

bele semu
bele semu

Reputation: 1

make sure the architecture of your docker image is compatable with vertexai worker node, which uses amd64 architecture, for example if you build on mac M2 it is incompatible: docker buildx create --use

docker buildx build --platform linux/amd64 -t "name of your image on google cloud"

Upvotes: 0

masaya
masaya

Reputation: 490

In my case, the problem was that I was using CMD instead of ENTRYPOINT in the Dockerfile.

Let's use ENTRYPOINT like this document: Train an ML model with custom containers

#CMD ["python", "trainer/mnist.py"]
# failed -> the replica master 0 exited with a non-zero status of 127

# Try ENTRYPOINT!
ENTRYPOINT ["python", "trainer/mnist.py"]

This solution may not be the cause in your case though, It may be good idea to check if the cause is a Dockerfile or not 🙂 It may be useful to check the differences between the sample Dockerfile in the above link and your own Dockerfile.

Upvotes: 5

Related Questions