user674669
user674669

Reputation: 12352

Fake liveness/readiness probe in kubernetes

Is it possible to fake a container to always be ready/live in kubernetes so that kubernetes thinks that the container is live and doesn't try to kill/recreate the container? I am looking for a quick and hacky solution, preferably.

Upvotes: 11

Views: 8683

Answers (2)

Eduardo Baitello
Eduardo Baitello

Reputation: 11346

Liveness and Readiness probes are not required by k8s controllers, you can simply remove them and your containers will be always live/ready.

If you want the hacky approach anyways, use the exec probe (instead of httpGet) with something dummy that always returns 0 as exit code. For example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
        livenessProbe:
          exec:
            command:
            - touch
            - /tmp/healthy
        readinessProbe:
          exec:
            command:
            - touch
            - /tmp/healthy

Please note that this will turn into ready/live state only in case you did not specify the flag readOnlyRootFilesystem: true.

Upvotes: 17

neoakris
neoakris

Reputation: 5075

I'd like to add background contextual information about why / how this can be useful to real world applications.

Also by pointing out some additional info about why this question is useful I can come up with an even better answer.

First off why might you want to implement a fake startup / readiness / liveness probe?
Let's say you have a custom containerized application, you're in a rush so you go live without any liveness or readiness probes.

Scenario 1:
You have a deployment with 1 replica, but you notice that whenever you go to update your app (push a new version via a rolling update), your monitoring platform reports occasionally 400, 500, and timeout errors during the rolling update. Post update you're at 1 replica and the errors go away.

Scenario 2:
You have enough traffic to warrant autoscaling and multiple replicas. You consistently get 1-3% errors, and 97% success.

Why are you getting errors in both scenarios?
Let's say it takes 1 minute to finish booting up / be ready to receive traffic. If you don't have readiness probes then newly spawned instances of your container will receive traffic before they've finished booting up / become ready to receive traffic. So the newly spawned instances are probably causing temporary 400, 500, and timeout errors.

How to fix:
You can fix the occasional errors in Scenario 1 and 2 by adding a readiness probe with an initialDelaySeconds (or startup probe), basically something that waits long enough for your container app to finish booting up.

Now the correct and proper best practice thing to do is to write a /health endpoint that properly reflects the health of your app. But writing an accurate healthcheck endpoint can take time. In many cases you can get the same end result (make the errors go away), without the effort of creating a /health endpoint by faking it, and just adding a wait period that waits for your app to finish booting up before sending traffic to it. (again /health is best practice, but for the ain't nobody got time for that crowd, faking it can be a good enough stop gap solution)

Below is a better version of a fake readiness probe:
Also here's why it's better

  1. exec based liveness probes don't work in 100% of cases, they assume shell exists on the container, and that commands exist on the container. There's scenarios where hardened containers don't have things like a shell or touch command.
  2. httpGet, tcpSocket, and grcp liveness probes are done from the perspective of the node running kubelet (the kubernetes agent) so they don't depend on the software installed in the container and should work in on hardened containers that are missing things like touch command or even scratch container. (In other words this soln should work in 100% of cases vs 99% of the time)
  3. An alternative to startup probe is to use initialDelaySeconds with a readiness Probe, but that creates unnecessary traffic compared to a startup probe that runs once. (Again this isn't the best solution in terms of accuracy/fastest possible startup time, but often a good enough solution that's very practical.)
  4. Run my example in a cluster and you'll see it's not ready for 60 seconds, then becomes ready after 60 seconds.
  5. Since this is a fake probe it's pointless to use readiness/liveness probe, just go with startup probe as that will cut down on unnecessary traffic.
  6. In the absence of a readiness probe the startup probe will have the effect of a readiness probe (block it from being ready until the probe passes, but only during initial start up)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: useful-hack
  labels:
    app: always-true-tcp-probe
spec:
  replicas: 1
  strategy: 
    type: Recreate #dev env fast feedback loop optimized value, don't use in prod
  selector:
    matchLabels:
      app: always-true-tcp-probe
  template:
    metadata:
      labels:
        app: always-true-tcp-probe
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        startupProbe:
          tcpSocket:
            host: 127.0.0.1  #Since kubelet does the probes, this is node's localhost, not pod's localhost
            port: 10250      #worker node kubelet listening port
          successThreshold: 1
          failureThreshold: 2
          initialDelaySeconds: 60 #wait 60 sec before starting the probe  

Additional Notes:

  1. The above example keeps traffic within the LAN this has several benefits.
    • It'll work in internet disconnected environments.
    • It won't incur egress network charges
  2. The below example will only work for internet connected environments and isn't too bad for a startup probe, but would be a bad idea for a readiness / liveness probe as it could clog the NAT GW bandwidth, I'm only including it to point out something of interest.
        startupProbe:
          httpGet:
            host: google.com  #default's to pod IP
            path: /
            port: 80
            scheme: HTTP
          successThreshold: 1
          failureThreshold: 2
          initialDelaySeconds: 60
---
        startupProbe:
          tcpSocket:
            host: 1.1.1.1  #CloudFlare
            port: 53       #DNS
          successThreshold: 1
          failureThreshold: 2
          initialDelaySeconds: 60

The interesting bit:
Remember I said "httpGet, tcpSocket, and grcp liveness probes are done from the perspective of the node running kubelet (the kubernetes agent)." Kubelet runs on the worker node's host OS, which is configured for upstream DNS, in other words it doesn't have access to inner cluster DNS entries that kubedns is aware of. So you can't specify Kubernetes service names in these probes.

Additionally Kubernetes Service IPs won't work for the probes either since they're VIPs (Virtual IPs) that only* exist in iptables (*most cases).

Upvotes: 1

Related Questions