Igor Artamonov
Igor Artamonov

Reputation: 35961

GKE, automatic restart of stuck node

Sometimes a node backing GKE cluster goes down, with NotReady status:

$ kubectl get nodes
NAME                        STATUS     AGE       VERSION
gke-my-pool-f8045547-60gw   Ready      10d       v1.6.2
gke-my-pool-f8045547-7c7e   NotReady   10d       v1.6.2

Node can stuck for days in NotReady, until I manually restart it.

I have a Health check for my pods, so all of them go to other nodes, but the problem that this stale node still has GCE disks attached. So some of pods are unable to start on any of other nodes, until I manually detach disks (or restart stale node).

This basically kills whole idea of Kubernetes, because it this happens few times a day, so I have to babysit it whole day. Is there any way to configure Kubernetes or GCE to automate this? Most simple way would be automatic restart of NotReady nodes, but it seems that there no way to configure health check for nodes itself. Another option would be automatic unmount of disks, when it requested from another machine, but I don't see any way to configure that too.

Upvotes: 0

Views: 4716

Answers (1)

Fabio Yeon
Fabio Yeon

Reputation: 251

GKE has a node auto-repair functionality that will monitor the node's health status and trigger an automatic repair event (currently a node recreation for NotReady nodes). It's currently in Beta, but you can try it: https://cloud.google.com/container-engine/docs/node-auto-repair

Upvotes: 2

Related Questions