gabhijit
gabhijit

Reputation: 3585

How to determine a failed kubernetes deployment?

I create a Pod with Replica count of say 2, which runs an application ( a simple web-server), basically it's always running command - However due to mis-configuration, sometimes the command exits and the pod is then Terminated.

Due to default restartPolicy of Always the pod (and hence the container) is restarted and eventually the Pod status is CrashLoopBackOff.

If I do a kubectl describe deployment, it shows Condition as Progressing=True and Available=False.

This looks fine - the question is - how do I mark my deployment as 'failed' in the above case?

Adding a spec.ProgressDeadlineSeconds doesn't seem to be having an effect.

Will simply saying restartPolicy as Never be enough in the Pod specification?

A related question, is there a way of getting this information as a trigger/webhook, without doing a rollout status watch?

Upvotes: 2

Views: 12697

Answers (2)

Rotem jackoby
Rotem jackoby

Reputation: 22148

A bit of theory

Regarding your question:

How do I mark my deployment as 'failed' in the above case?

Kubernetes gives you two types of health checks:

1 ) Readiness
Readiness probes are designed to let Kubernetes know when your app is ready to serve traffic.
Kubernetes makes sure the readiness probe passes before allowing a service to send traffic to the pod.
If a readiness probe starts to fail, Kubernetes stops sending traffic to the pod until it passes.

2 ) Liveness
Liveness probes let Kubernetes know if your app is alive or dead.
If you app is alive, then Kubernetes leaves it alone. If your app is dead, Kubernetes removes the Pod and starts a new one to replace it.

At the moment (v1.19.0) , Kubernetes has support for 3 types mechanisms for implementing liveness and readiness probes:

A ) ExecAction: Executes a specified command inside the container. The diagnostic is considered successful if the command exits with a status code of 0.

B ) TCPSocketAction: Performs a TCP check against the Pod's IP address on a specified port. The diagnostic is considered successful if the port is open.

C ) HTTPGetAction: Performs an HTTP GET request against the Pod's IP address on a specified port and path. The diagnostic is considered successful if the response has a status code greater than or equal to 200 and less than 400.


In your case:

If the process in your container is able to crash on its own whenever it encounters an issue or becomes unhealthy, you do not necessarily need a liveness probe; the kubelet will automatically perform the correct action in accordance with the Pod's restartPolicy.

I think that in your case (the need to refer to a deployment as succeed / failed and take the proper action) you should:

Step 1:
Setup a HTTP/TCP readiness Probe - for example:

   readinessProbe:
      httpGet:
         path: /health-check
         port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 2

Where:

initialDelaySeconds — The number of seconds since the container has started before the readiness probe can be initiated.

periodSeconds — How often to perform the readiness probe.

failureThreshold — The number of tries to perform the readiness probe if the probe fails on pod start.

Step 2:
Choose the relevant rolling update strategy and how you should handle cases of failures of new pods (consider reading this thread for examples).

A few references you can follow:

Container probes
Kubernetes Liveness and Readiness Probes
Kubernetes : Configure Liveness and Readiness Probes
Kubernetes and Containers Best Practices - Health Probes
Creating Liveness Probes for your Node.js application in Kubernetes


A Failed Deployment

A deployment (or the rollout process) will be considered as Failed if it tries to deploy its newest ReplicaSet without ever completing over and over again until the progressDeadlineSeconds interval has exceeded.

Then K8S you update the status with:

Conditions:
  Type            Status  Reason
  ----            ------  ------
  Available       True    MinimumReplicasAvailable
  Progressing     False   ProgressDeadlineExceeded
  ReplicaFailure  True    FailedCreate

Read more in here.

Upvotes: 4

Symmetric
Symmetric

Reputation: 4733

There is no Kubernetes concept for a "failed" deployment. Editing a deployment registers your intent that the new ReplicaSet is to be created, and k8s will repeatedly try to make that intent happen. Any errors that are hit along the way will cause the rollout to block, but they will not cause k8s to abort the deployment.

AFAIK, the best you can do (as of 1.9) is to apply a deadline to the Deployment, which will add a Condition that you can detect when a deployment gets stuck; see https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment and https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds.

It's possible to overlay your own definitions of failure on top of the statuses that k8s provides, but this is quite difficult to do in a generic way; see this issue for a (long!) discussion on the current status of this: https://github.com/kubernetes/kubernetes/issues/1899

Here's some Python code (using pykube) that I wrote a while ago that implements my own definition of ready; I abort my deploy script if this condition does not obtain after 5 minutes.

def _is_deployment_ready(d, deployment):
    if not deployment.ready:
        _log.debug('Deployment not completed.')
        return False

    if deployment.obj["status"]["replicas"] > deployment.replicas:
        _log.debug('Old replicas not terminated.')
        return False

    selector = deployment.obj['spec']['selector']['matchLabels']
    pods = Pod.objects(d.api).filter(namespace=d.namespace, selector=selector)
    if not pods:
        _log.info('No pods found.')
        return False

    for pod in pods:
        _log.info('Is pod %s ready? %s.', pod.name, pod.ready)
        if not pod.ready:
            _log.debug('Pod status: %s', pod.obj['status'])
            return False
    _log.info('All pods ready.')
    return True

Note the individual pod check, which is required because a deployment seems to be deemed 'ready' when the rollout completes (i.e. all pods are created), not when all of the pods are ready.

Upvotes: 1

Related Questions