google-kubernetes-enginegoogle-cloud-monitoringreadinessprobelivenessprobe

Reputation: 595

Liveness Probe, Readiness Probe not called in expected duration

On GKE, I tried to use readiness probe/ liveness probe , and post alert using monitoring https://cloud.google.com/monitoring/alerts/using-alerting-ui

as a test, I create a pod which has readiness probe/ liveness probe. Probe check failed everytime, as I expected.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/liveness
    args:
    - /server
    readinessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: X-Custom-Header
          value: Awesome
      initialDelaySeconds: 0
      periodSeconds: 10      
      timeoutSeconds: 10
      successThreshold: 1
      failureThreshold: 3
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: X-Custom-Header
          value: Awesome
      initialDelaySeconds: 20
      periodSeconds: 60
      timeoutSeconds: 30      
      successThreshold: 1
      failureThreshold: 3

And checking GCP log, both error logs showed up based on periodSeconds first.

Readiness probe: every 10 seconds

2021-02-21 13:26:30.000 JST Readiness probe failed: HTTP probe failed with statuscode: 500

2021-02-21 13:26:40.000 JST Readiness probe failed: HTTP probe failed with statuscode: 500

Liveness probe: every 1 minute

2021-02-21 13:25:40.000 JST Liveness probe failed: HTTP probe failed with statuscode: 500

2021-02-21 13:26:40.000 JST Liveness probe failed: HTTP probe failed with statuscode: 500

But, after running this pod several minutes

Liveness probe check didn't not called anymore
Readiness probe check called but interval became long ( maximum interval looks about 10 minutes)

$ kubectl get event
LAST SEEN   TYPE      REASON      OBJECT              MESSAGE
30m         Normal    Pulling     pod/liveness-http   Pulling image "k8s.gcr.io/liveness"
25m         Warning   Unhealthy   pod/liveness-http   Readiness probe failed: HTTP probe failed with statuscode: 500
20m         Warning   BackOff     pod/liveness-http   Back-off restarting failed container
20m         Normal    Scheduled   pod/liveness-http   Successfully assigned default/liveness-http to gke-cluster-default-pool-8bc9c75c-rfgc
17m         Normal    Pulling     pod/liveness-http   Pulling image "k8s.gcr.io/liveness"
17m         Normal    Pulled      pod/liveness-http   Successfully pulled image "k8s.gcr.io/liveness"
17m         Normal    Created     pod/liveness-http   Created container liveness
20m         Normal    Started     pod/liveness-http   Started container liveness
4m59s       Warning   Unhealthy   pod/liveness-http   Readiness probe failed: HTTP probe failed with statuscode: 500
17m         Warning   Unhealthy   pod/liveness-http   Liveness probe failed: HTTP probe failed with statuscode: 500
17m         Normal    Killing     pod/liveness-http   Container liveness failed liveness probe, will be restarted

In my plan, I would create alert policy, whose condition is like

if liveness probe error happens 3 times in 3 minutes

but if probe check didn't called as I expect, these policy didn't work; even if pod is not running, alert became fixed

Why Liveness probe didn't run, and interval of Readiness probe changed ?

Note: if there are other good alert policy to check liveness of pod, I would not care that behavior. I appreciate it if someone would advice me what kind of alert policy is ideal to check pod.

Upvotes: 2

Answers (1)

PjoterS

Reputation: 14102

Background

In Configure Liveness, Readiness and Startup Probes documentation you can find information:

The kubelet uses liveness probes to know when to restart a container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a container in such a state can help to make the application more available despite bugs.

The kubelet uses readiness probes to know when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.

As GKE master is managed by google, you won't find kubelet logs using CLI (you might try to use Stackdriver). I've tested it on the Kubeadm cluster and set the verbosity level to 8.

When you are using $ kubectl get events you are getting events only from the last hour (it can be changed in Kubernetes settings - Kubeadm but I don't think it can be changed in GKE as master is managed by google.)

$ kubectl get events
LAST SEEN   TYPE      REASON                    OBJECT              MESSAGE
37m         Normal    Starting                  node/kubeadm        Starting kubelet.
...
33m         Normal    Scheduled                 pod/liveness-http   Successfully assigned default/liveness-http to kubeadm
33m         Normal    Pulling                   pod/liveness-http   Pulling image "k8s.gcr.io/liveness"
33m         Normal    Pulled                    pod/liveness-http   Successfully pulled image "k8s.gcr.io/liveness" in 893.953679ms
33m         Normal    Created                   pod/liveness-http   Created container liveness
33m         Normal    Started                   pod/liveness-http   Started container liveness
3m12s       Warning   Unhealthy                 pod/liveness-http   Readiness probe failed: HTTP probe failed with statuscode: 500
30m         Warning   Unhealthy                 pod/liveness-http   Liveness probe failed: HTTP probe failed with statuscode: 500
8m17s       Warning   BackOff                   pod/liveness-http   Back-off restarting failed container

Again the same command after ~1 hour.

$ kubectl get events
LAST SEEN   TYPE      REASON      OBJECT              MESSAGE
33s         Normal    Pulling     pod/liveness-http   Pulling image "k8s.gcr.io/liveness"
5m40s       Warning   Unhealthy   pod/liveness-http   Readiness probe failed: HTTP probe failed with statuscode: 500
15m         Warning   BackOff     pod/liveness-http   Back-off restarting failed container

Tests

The Readiness Probe check is executed each 10 seconds for more than one hour.

Mar 09 14:48:34 kubeadm kubelet[3855]: I0309 14:48:34.222085    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 14:48:44 kubeadm kubelet[3855]: I0309 14:48:44.221782    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 14:48:54 kubeadm kubelet[3855]: I0309 14:48:54.221828    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
...
Mar 09 15:01:34 kubeadm kubelet[3855]: I0309 15:01:34.222491    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4
562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:01:44 kubeadm kubelet[3855]: I0309 15:01:44.221877    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:01:54 kubeadm kubelet[3855]: I0309 15:01:54.221976    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
...
Mar 09 15:10:14 kubeadm kubelet[3855]: I0309 15:10:14.222163    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:10:24 kubeadm kubelet[3855]: I0309 15:10:24.221744    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 15:10:34 kubeadm kubelet[3855]: I0309 15:10:34.223877    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
...
Mar 09 16:04:14 kubeadm kubelet[3855]: I0309 16:04:14.222853    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 16:04:24 kubeadm kubelet[3855]: I0309 16:04:24.222531    3855 prober.go:117] Readiness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500

Also, there are Liveness probe entries.

Mar 09 16:12:58 kubeadm kubelet[3855]: I0309 16:12:58.462878    3855 prober.go:117] Liveness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 16:13:58 kubeadm kubelet[3855]: I0309 16:13:58.462906    3855 prober.go:117] Liveness probe for "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a):liveness" failed (failure): HTTP probe failed with statuscode: 500
Mar 09 16:14:58 kubeadm kubelet[3855]: I0309 16:14:58.465470    3855 kuberuntime_manager.go:656] Container "liveness" ({"docker" "95567f85708ffac8b34b6c6f2bdb4
9d8eb57e7704b7b416083c7f296dd40cd0b"}) of pod liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a): Container liveness failed liveness probe, will be restarted
Mar 09 16:14:58 kubeadm kubelet[3855]: I0309 16:14:58.465587    3855 kuberuntime_manager.go:712] Killing unwanted container "liveness"(id={"docker" "95567f85708ffac8b34b6c6f2bdb49d8eb57e7704b7b416083c7f296dd40cd0b"}) for pod "liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a)"

Total time of tests:

$ kubectl get po -w
NAME            READY   STATUS    RESTARTS   AGE
liveness-http   0/1     Running   21         99m
liveness-http   0/1     CrashLoopBackOff   21         101m
liveness-http   0/1     Running            22         106m
liveness-http   1/1     Running            22         106m
liveness-http   0/1     Running            22         106m
liveness-http   0/1     Running            23         109m
liveness-http   1/1     Running            23         109m
liveness-http   0/1     Running            23         109m
liveness-http   0/1     CrashLoopBackOff   23         112m
liveness-http   0/1     Running            24         117m
liveness-http   1/1     Running            24         117m
liveness-http   0/1     Running            24         117m

Conclusion

Liveness probe check didn't not called anymore

Liveness check is created when Kubernetes create pod and is recreated each time that Pod is restarted. In your configuration you have set initialDelaySeconds: 20 so after creating a pod, Kubernetes will wait 20 seconds, then it will call liveness probe 3 times (as you have set failureThreshold: 3). After 3 fails, Kubernetes will restart this pod according to RestartPolicy. Also in logs you will be able to find in logs:

Mar 09 16:14:58 kubeadm kubelet[3855]: I0309 16:14:58.465470    3855 kuberuntime_manager.go:656] Container "liveness" ({"docker" "95567f85708ffac8b34b6c6f2bdb4
9d8eb57e7704b7b416083c7f296dd40cd0b"}) of pod liveness-http_default(8c87a08e-34aa-4bb1-be9b-fdca39a4562a): Container liveness failed liveness probe, will be restarted

Why will it be restarted? Answer can be found in Container probes.

livenessProbe: Indicates whether the container is running. If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy.

Default Restart Policy in GKE is Always. So your pod will be restarted over and over again.

Readiness probe check called but interval became long ( maximum interval looks about 10 minutes)

I think you've come to that conclusion as you have based on $ kubectl get events and $ kubectl describe po. In both cases, events as default are removed after 1 hour. In my Tests part you can see that Readiness probe entries are from 14:48:34 till 16:04:24, so each 10 seconds Kubernetes calls Readiness Probe.

Why Liveness probe didn't run, and interval of Readiness probe changed?

As I show you in the Tests part, the Readiness probe didn't change. Misleading in this case was using $ kubectl events. Regarding Liveiness Probe it's still calling but only 3 times after pod will be created/restarted. Also I've included output of $ kubectl get po -w. When pod is recreated, you might find in kubelet logs those liveness probes.

In my plan, I would create alert policy, whose condition is like:

if liveness probe error happens 3 times in 3 minutes

If liveness probe fails 3 times, with your current setup it will restart this pod. In that situation you could use each restart to create an alert.

Metric: kubernetes.io/container/restart_count
Resource type: k8s_container

Some useful information you can find in Stackoverflow cases regarding Monitoring alert like: