Cluster is restarting everyday unexpectedly

Question

Recently we have created a cluster on Kubernetes Engine (GCP) and we started to notice a strange behavior on it. Every day the nodes are getting stopped and recreated automatically in a certain time of day, making applications unavailable for a few minutes.

How the incidents are displayed in Stackdriver dashboard:

In order to understand the root cause of the problem, I analyzed the logs in Stackdriver, taking as a reference the incident that happened today (2017-12-19 12:22pm).

Cluster log:

The closest entry that exists related to the incident is just at 12:26pm (probably the moment that the cluster was coming back).

Node log:

The instance log also doesn't seem to help too much. The records closest to the incident just appears at 12:23pm (also after the instance start to come back).

Has anyone ever been through this situation before or have any idea how can we debug it better and discover what is causing this behavior?

The cause of the incident apparently is not been shown in Stackdriver logs.

ihor_dvoretskyi · Accepted Answer

The described behavior is very similar to how the preemptible nodes in GKE behave (they live a maximum of 24 hours).

If you're unsure if your nodes are preemptible, check the GCP WebUI (my sample below, check the "Preemptible nodes" line), or via CLI:

$ gcloud compute instances list | grep gke | awk '{print $4}'

If the CLI command will return "true", that means that the nodes are preemptible (see below):

$ gcloud compute instances list | grep gke | awk '{print $4}'
true
true
true

Note: if you have multiple GKE clusters under the same project, after grep command add your GKE cluster name.

Cluster is restarting everyday unexpectedly

Answers (1)

Related Questions