Thanhma San
Thanhma San

Reputation: 1497

Should I do liveness probe and readiness probe every second?

In my K8S workloads, I implement Readiness probe and Liveness probe for pods health check.

I'm wondering that should I set the interval (periodSeconds) as low as 1 sec, as it will consume more resources, right?

Is there best practices when doing the pod health check?

Upvotes: 5

Views: 2435

Answers (2)

Rotem jackoby
Rotem jackoby

Reputation: 22128

I would split the discussion between the two types of probes.

Regarding readiness probe, I think @csteph answer provides what you need.

Regarding liveness probes I would say:

(1) Most chances you don't need to use it every second.

(2) There is also a good chance you don't really need it.

(3) Even if you think you need it, be careful and read below.

Liveliness probes should be used when the application can reach a state when it can't recover itself and you should ensure that the probe's logic really knows how to handle this situation.

From the docs:

Caution:
Liveness probes can be a powerful way to recover from application failures, but they should be used with caution. Liveness probes must be configured carefully to ensure that they truly indicate unrecoverable application failure, for example a deadlock.

Note:
Incorrect implementation of liveness probes can lead to cascading failures. This results in restarting of container under high load; failed client requests as your application became less scalable; and increased workload on remaining pods due to some failed pods. Understand the difference between readiness and liveness probes and when to apply them for your app.

Upvotes: 0

csteph
csteph

Reputation: 116

Firstly, it is important to understand the difference between Liveness and Readiness. The tl;dr is: Liveness is about whether K8s should kill and restart the container, Readiness is about whether the container is able to accept requests. It is likely that you want different parameters for both.

Whether K8s takes any action based on the outcome of the probe depends on the failureThreshold. This is the number of times in a row the probe has to fail before K8s does something. If you combine this with periodSeconds you can tune the sensitivity of your probes.

In general you want to balance:

  • the time it takes K8s to take action with how quickly your service can be expected to recover based on the probe
  • the "cost" of the probes. For example if your Readiness probe connects to a database, then you are adding 1 Query Per Second (QPS) load to your database per replica (With 100 replicas, you would be generating 100QPS just through probes!)
  • the reliability of your probe, also known as "flakiness". What is the false negative rate - i.e what proportion of the time the probe reports failed but the service is actually running with in expected performance rates

Here is one way of thinking about it:

  • Work out how long your service can be in the failed state before K8s should take action. This should be based on how long it would take to recover (e.g. restart in the case of Liveness)
  • If a probe is "expensive", have a longer periodSeconds and smaller failureThreshold
  • If a probe is "flaky" (i.e. occasionally reports failed and then reports working very quickly afterwards) have a shorter periodSeconds and larger failureThreashold.

Upvotes: 7

Related Questions