Gabriel Gomes
Gabriel Gomes

Reputation: 348

Cluster is restarting everyday unexpectedly

Recently we have created a cluster on Kubernetes Engine (GCP) and we started to notice a strange behavior on it. Every day the nodes are getting stopped and recreated automatically in a certain time of day, making applications unavailable for a few minutes.

How the incidents are displayed in Stackdriver dashboard:

enter image description here

In order to understand the root cause of the problem, I analyzed the logs in Stackdriver, taking as a reference the incident that happened today (2017-12-19 12:22pm).

Cluster log:

The closest entry that exists related to the incident is just at 12:26pm (probably the moment that the cluster was coming back).

enter image description here

Node log:

The instance log also doesn't seem to help too much. The records closest to the incident just appears at 12:23pm (also after the instance start to come back).

enter image description here

Has anyone ever been through this situation before or have any idea how can we debug it better and discover what is causing this behavior?

The cause of the incident apparently is not been shown in Stackdriver logs.

Upvotes: 1

Views: 1174

Answers (1)

ihor_dvoretskyi
ihor_dvoretskyi

Reputation: 582

The described behavior is very similar to how the preemptible nodes in GKE behave (they live a maximum of 24 hours).

If you're unsure if your nodes are preemptible, check the GCP WebUI (my sampleenter image description here below, check the "Preemptible nodes" line), or via CLI:

$ gcloud compute instances list | grep gke | awk '{print $4}'

If the CLI command will return "true", that means that the nodes are preemptible (see below):

$ gcloud compute instances list | grep gke | awk '{print $4}'
true
true
true

Note: if you have multiple GKE clusters under the same project, after grep command add your GKE cluster name.

Upvotes: 2

Related Questions