Daemon
Daemon

Reputation: 89

Kuberhealthy deployment health check fails frequently saying cluster ClusterUnhealthy kuberhealth

Kuberhealthy deployment health check fails frequently saying [Prometheus]: [FIRING:2] kuberhealthy (ClusterUnhealthy kuberhealthy http kuberhealthy observability/kube-prometheus-stack-prometh

Steps to reproduce:

kuberhealthy runs a deployment check regularly While the deployment seems to complete it fails to report the status on kuberhealthy service

$ k get events -nkuberhealthy | grep deployment | tail
12m         Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled down replica set deployment-deployment-XXX to 2
12m         Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled up replica set deployment-deployment-XXXto 4
12m         Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled down replica set deployment-deployment-XXX to 0
3m31s       Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled up replica set deployment-deployment-XXX to 4
3m9s        Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled up replica set deployment-deployment-XXX to 2
3m9s        Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled down replica set deployment-deployment-69459d778b to 2
3m9s        Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled up replica set deployment-deployment-XXX to 4
3m          Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled down replica set deployment-deployment-XXX to 0
63m         Warning   FailedToUpdateEndpoint   endpoints/deployment-svc                              Failed to update endpoint kuberhealthy/deployment-svc: Operation cannot be fulfilled on endpoints "deployment-svc": the object has been modified; please apply your changes to the latest version and try again
53m         Warning   FailedToUpdateEndpoint

debug logs

$ k logs deployment-XXX -nkuberhealthy
time="2022-12-16T12:36:43Z" level=info msg="Found instance namespace: kuberhealthy"
time="2022-12-16T12:36:43Z" level=info msg="Kuberhealthy is located in the kuberhealthy namespace."
time="2022-12-16T12:36:43Z" level=info msg="Debug logging enabled."
time="2022-12-16T12:36:43Z" level=debug msg="[/app/deployment-check]"
time="2022-12-16T12:36:43Z" level=info msg="Parsed CHECK_IMAGE: XXXX"
time="2022-12-16T12:36:43Z" level=info msg="Parsed CHECK_IMAGE_ROLL_TO: XXX"
time="2022-12-16T12:36:43Z" level=info msg="Found pod namespace: kuberhealthy"
time="2022-12-16T12:36:43Z" level=info msg="Performing check in kuberhealthy namespace."
time="2022-12-16T12:36:43Z" level=info msg="Parsed CHECK_DEPLOYMENT_REPLICAS: 2"
time="2022-12-16T12:36:43Z" level=info msg="Parsed CHECK_SERVICE_ACCOUNT: default"
time="2022-12-16T12:36:43Z" level=info msg="Check time limit set to: 14m46.760673918s"
time="2022-12-16T12:36:43Z" level=info msg="Parsed CHECK_DEPLOYMENT_ROLLING_UPDATE: true"
time="2022-12-16T12:36:43Z" level=info msg="Check deployment image will be rolled from [XXX] to [XXXX]"
time="2022-12-16T12:36:43Z" level=debug msg="Allowing this check 14m46.760673918s to finish."
time="2022-12-16T12:36:43Z" level=info msg="Kubernetes client created."
time="2022-12-16T12:36:43Z" level=info msg="Waiting for node to become ready before starting check."
time="2022-12-16T12:36:43Z" level=debug msg="Checking if the kuberhealthy endpoint: XXX is ready."
time="2022-12-16T12:36:43Z" level=debug msg="XXX."
time="2022-12-16T12:36:43Z" level=debug msg="Kuberhealthy endpoint: XXX is ready. Proceeding to run check."
time="2022-12-16T12:36:43Z" level=info msg="Starting check."
time="2022-12-16T12:36:43Z" level=info msg="Wiping all found orphaned resources belonging to this check."
time="2022-12-16T12:36:43Z" level=info msg="Attempting to find previously created service(s) belonging to this check."
time="2022-12-16T12:36:43Z" level=debug msg="Found 1 service(s)."
time="2022-12-16T12:36:43Z" level=debug msg="Service: kuberhealthy"
time="2022-12-16T12:36:43Z" level=info msg="Did not find any old service(s) belonging to this check."
time="2022-12-16T12:36:43Z" level=info msg="Attempting to find previously created deployment(s) belonging to this check."
time="2022-12-16T12:36:44Z" level=debug msg="Found 1 deployment(s)"
time="2022-12-16T12:36:44Z" level=debug msg=kuberhealthy
time="2022-12-16T12:36:44Z" level=info msg="Did not find any old deployment(s) belonging to this check."
time="2022-12-16T12:36:44Z" level=info msg="Successfully cleaned up prior check resources."
time="2022-12-16T12:36:44Z" level=info msg="Creating deployment resource with 2 replica(s) in kuberhealthy namespace using image XXX]"
time="2022-12-16T12:36:44Z" level=info msg="Creating container using image [XXX]"
time="2022-12-16T12:36:44Z" level=info msg="Created deployment resource."
time="2022-12-16T12:36:44Z" level=info msg="Creating deployment in cluster with name: deployment-deployment"
time="2022-12-16T12:36:44Z" level=info msg="Watching for deployment to exist."
time="2022-12-16T12:36:44Z" level=debug msg="Received an event watching for deployment changes: deployment-deployment got event ADDED"
time="2022-12-16T12:36:47Z" level=debug msg="Received an event watching for deployment changes: deployment-deployment got event MODIFIED"
time="2022-12-16T12:36:48Z" level=debug msg="Received an event watching for deployment changes: deployment-deployment got event MODIFIED"
time="2022-12-16T12:36:53Z" level=debug msg="Received an event watching for deployment changes: deployment-deployment got event MODIFIED"
time="2022-12-16T12:36:53Z" level=info msg="Deployment is reporting Available with True."
time="2022-12-16T12:36:53Z" level=info msg="Created deployment in kuberhealthy namespace: deployment-deployment"
time="2022-12-16T12:36:53Z" level=info msg="Creating service resource for kuberhealthy namespace."
time="2022-12-16T12:36:53Z" level=info msg="Created service resource."
time="2022-12-16T12:36:53Z" level=info msg="Creating service in cluster with name: deployment-svc"
time="2022-12-16T12:36:53Z" level=info msg="Watching for service to exist."
time="2022-12-16T12:36:53Z" level=debug msg="Received an event watching for service changes: ADDED"
time="2022-12-16T12:36:53Z" level=info msg="Cluster IP found:XXX"
time="2022-12-16T12:36:53Z" level=info msg="Created service in kuberhealthy namespace: deployment-svc"
time="2022-12-16T12:36:53Z" level=debug msg="Retrieving a cluster IP belonging to: deployment-svc"
time="2022-12-16T12:36:53Z" level=info msg="Found service cluster IP address: XXX"
time="2022-12-16T12:36:53Z" level=info msg="Looking for a response from the endpoint."
time="2022-12-16T12:36:53Z" level=debug msg="Setting timeout for backoff loop to: 3m0s"
time="2022-12-16T12:36:53Z" level=info msg="Beginning backoff loop for HTTP GET request."
time="2022-12-16T12:36:53Z" level=debug msg="Making GET to XXX"
time="2022-12-16T12:36:53Z" level=debug msg="Got a 401"
time="2022-12-16T12:36:53Z" level=info msg="Retrying in 5 seconds."
time="2022-12-16T12:36:58Z" level=error msg="error occurred making request to service in cluster: could not get a response from the given address: XXX"
time="2022-12-16T12:36:58Z" level=info msg="Cleaning up deployment and service."
time="2022-12-16T12:36:58Z" level=info msg="Attempting to delete service deployment-svc in kuberhealthy namespace."
time="2022-12-16T12:36:58Z" level=debug msg="Checking if service has been deleted."
time="2022-12-16T12:36:58Z" level=debug msg="Delete service and wait has not yet timed out."
time="2022-12-16T12:36:58Z" level=debug msg="Waiting 5 seconds before trying again."
time="2022-12-16T12:37:03Z" level=info msg="Attempting to delete deployment in kuberhealthy namespace."
time="2022-12-16T12:37:03Z" level=debug msg="Checking if deployment has been deleted."
time="2022-12-16T12:37:03Z" level=debug msg="Delete deployment and wait has not yet timed out."
time="2022-12-16T12:37:03Z" level=debug msg="Waiting 5 seconds before trying again."
time="2022-12-16T12:37:08Z" level=info msg="Finished clean up process."
time="2022-12-16T12:37:08Z" level=error msg="Reporting errors to Kuberhealthy: [could not get a response from the given address: XXX"

Upvotes: 0

Views: 444

Answers (1)

Veera Nagireddy
Veera Nagireddy

Reputation: 1890

Looks like your kubernetes deployment is working fine. It's common behavior of k8s to let the k8s clients (Controllers) know to try again, it's perfectly fine and you can safely ignore that.

Let me try to explain the generic cause of such warning in the events:

The K8s API Server is implementing something called "Optimistic concurrency control" (sometimes referred to as optimistic locking). This is a method where instead of locking a piece of data and preventing it from being read or updated while the lock is in place, the piece of data includes a version number. Every time the data is updated, the version number increases.

When updating the data, the version number is checked to see if it has increased between the time the client reads the data and the time it submits the update. If this happens, the update is rejected and the client must re-read the new data and try to update it again. The result is that when two clients try to update the same data entry, only the first one succeeds.

You can also refer to the SO for relevant information.

Also please go through the Kubernetes Health Checks: Everything You Need to Know for more info.

EDIT: If you're running a version which doesn't contain the latest fixes/ patches,I recommend you upgrade your master and nodes to a newer minor version. For example, 1.12.10-gke.20 or 22. This will help isolate the issue on whether it is the GKE version or some other underlying issue, please go through the GKE release notes for more information. The nodes are unhealthy and deployments are experiencing timeouts. Upgrading might resolve the issue.

Seems the above warning message detects when you run the current file from Kubernetes Engine > Workloads > YAML. If so, then to solve the problem, you need to find the exact yaml file and then edit it as per your requirement, after that you can run the following command like $kubectl apply -f nginx-1.yaml. If this way does not work then please check the details of your executed operation (Including deployment of pod).

Please check Debug Podsinstructions, which describe some common troubleshooting to help users debug applications that are deployed into Kubernetes and not behaving correctly.

You may also visit troubleshooting document for Monitoring, Logging, and Debugging for more information.

Also go through another similar SO, which may help to resolve your issue.

Upvotes: 2

Related Questions