Reputation: 35961
Sometimes a node backing GKE cluster goes down, with NotReady
status:
$ kubectl get nodes
NAME STATUS AGE VERSION
gke-my-pool-f8045547-60gw Ready 10d v1.6.2
gke-my-pool-f8045547-7c7e NotReady 10d v1.6.2
Node can stuck for days in NotReady, until I manually restart it.
I have a Health check for my pods, so all of them go to other nodes, but the problem that this stale node still has GCE disks attached. So some of pods are unable to start on any of other nodes, until I manually detach disks (or restart stale node).
This basically kills whole idea of Kubernetes, because it this happens few times a day, so I have to babysit it whole day. Is there any way to configure Kubernetes or GCE to automate this? Most simple way would be automatic restart of NotReady
nodes, but it seems that there no way to configure health check for nodes itself. Another option would be automatic unmount of disks, when it requested from another machine, but I don't see any way to configure that too.
Upvotes: 0
Views: 4716
Reputation: 251
GKE has a node auto-repair functionality that will monitor the node's health status and trigger an automatic repair event (currently a node recreation for NotReady nodes). It's currently in Beta, but you can try it: https://cloud.google.com/container-engine/docs/node-auto-repair
Upvotes: 2