Reputation: 6769

Kubernetes: how to debug CrashLoopBackOff

I have the following setup:

A docker image omg/telperion on docker hub A kubernetes cluster (with 4 nodes, each with ~50GB RAM) and plenty resources

I followed tutorials to pull images from dockerhub to kubernetes

SERVICE_NAME=telperion
DOCKER_SERVER="https://index.docker.io/v1/"
DOCKER_USERNAME=username
DOCKER_PASSWORD=password
DOCKER_EMAIL="[email protected]"

# Create secret
kubectl create secret docker-registry dockerhub --docker-server=$DOCKER_SERVER --docker-username=$DOCKER_USERNAME --docker-password=$DOCKER_PASSWORD --docker-email=$DOCKER_EMAIL

# Create service yaml
echo "apiVersion: v1 \n\
kind: Pod \n\
metadata: \n\
  name: ${SERVICE_NAME} \n\
spec: \n\
  containers: \n\
    - name: ${SERVICE_NAME} \n\
      image: omg/${SERVICE_NAME} \n\
      imagePullPolicy: Always \n\
      command: [ \"echo\",\"done deploying $SERVICE_NAME\" ] \n\
  imagePullSecrets: \n\
    - name: dockerhub" > $SERVICE_NAME.yaml

 # Deploy to kubernetes
 kubectl create -f $SERVICE_NAME.yaml

Which results in the pod going into a CrashLoopBackoff

docker run -it -p8080:9546 omg/telperion works fine.

So my question is Is this debug-able?, if so, how do i debug this?

Some logs:

kubectl get nodes                                                                                          
NAME                    STATUS                     AGE       VERSION
k8s-agent-adb12ed9-0    Ready                      22h       v1.6.6
k8s-agent-adb12ed9-1    Ready                      22h       v1.6.6
k8s-agent-adb12ed9-2    Ready                      22h       v1.6.6
k8s-master-adb12ed9-0   Ready,SchedulingDisabled   22h       v1.6.6

kubectl get pods                                                                                               
NAME                        READY     STATUS             RESTARTS   AGE
telperion                    0/1       CrashLoopBackOff   10         28m

kubectl describe pod telperion
Name:           telperion
Namespace:      default
Node:           k8s-agent-adb12ed9-2/10.240.0.4
Start Time:     Wed, 21 Jun 2017 10:18:23 +0000
Labels:         <none>
Annotations:    <none>
Status:         Running
IP:             10.244.1.4
Controllers:    <none>
Containers:
  telperion:
    Container ID:       docker://c2dd021b3d619d1d4e2afafd7a71070e1e43132563fdc370e75008c0b876d567
    Image:              omg/telperion
    Image ID:           docker-pullable://omg/telperion@sha256:c7e3beb0457b33cd2043c62ea7b11ae44a5629a5279a88c086ff4853828a6d96
    Port:
    Command:
      echo
      done deploying telperion
    State:              Waiting
      Reason:           CrashLoopBackOff
    Last State:         Terminated
      Reason:           Completed
      Exit Code:        0
      Started:          Wed, 21 Jun 2017 10:19:25 +0000
      Finished:         Wed, 21 Jun 2017 10:19:25 +0000
    Ready:              False
    Restart Count:      3
    Environment:        <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-n7ll0 (ro)
Conditions:
  Type          Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  default-token-n7ll0:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-n7ll0
    Optional:   false
QoS Class:      BestEffort
Node-Selectors: <none>
Tolerations:    <none>
Events:
  FirstSeen     LastSeen        Count   From                            SubObjectPath                                   Type            Reason          Message
  ---------     --------        -----   ----                            -------------                                   --------        ------          -------
  1m            1m              1       default-scheduler                                                               Normal          Scheduled       Successfully assigned telperion to k8s-agent-adb12ed9-2
  1m            1m              1       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Normal          Created         Created container with id d9aa21fd16b682698235e49adf80366f90d02628e7ed5d40a6e046aaaf7bf774
  1m            1m              1       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Normal          Started         Started container with id d9aa21fd16b682698235e49adf80366f90d02628e7ed5d40a6e046aaaf7bf774
  1m            1m              1       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Normal          Started         Started container with id c6c8f61016b06d0488e16bbac0c9285fed744b933112fd5d116e3e41c86db919
  1m            1m              1       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Normal          Created         Created container with id c6c8f61016b06d0488e16bbac0c9285fed744b933112fd5d116e3e41c86db919
  1m            1m              2       kubelet, k8s-agent-adb12ed9-2                                                   Warning         FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "telperion" with CrashLoopBackOff: "Back-off 10s restarting failed container=telperion pod=telperion_default(f4e36a12-566a-11e7-99a6-000d3aa32f49)"

  1m    1m      1       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Normal  Started         Started container with id 3b911f1273518b380bfcbc71c9b7b770826c0ce884ac876fdb208e7c952a4631
  1m    1m      1       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Normal  Created         Created container with id 3b911f1273518b380bfcbc71c9b7b770826c0ce884ac876fdb208e7c952a4631
  1m    1m      2       kubelet, k8s-agent-adb12ed9-2                                                   Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "telperion" with CrashLoopBackOff: "Back-off 20s restarting failed container=telperion pod=telperion_default(f4e36a12-566a-11e7-99a6-000d3aa32f49)"

  1m    50s     4       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Normal  Pulling         pulling image "omg/telperion"
  47s   47s     1       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Normal  Started         Started container with id c2dd021b3d619d1d4e2afafd7a71070e1e43132563fdc370e75008c0b876d567
  1m    47s     4       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Normal  Pulled          Successfully pulled image "omg/telperion"
  47s   47s     1       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Normal  Created         Created container with id c2dd021b3d619d1d4e2afafd7a71070e1e43132563fdc370e75008c0b876d567
  1m    9s      8       kubelet, k8s-agent-adb12ed9-2   spec.containers{telperion}      Warning BackOff         Back-off restarting failed container
  46s   9s      4       kubelet, k8s-agent-adb12ed9-2                                                   Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "telperion" with CrashLoopBackOff: "Back-off 40s restarting failed container=telperion pod=telperion_default(f4e36a12-566a-11e7-99a6-000d3aa32f49)"

Edit 1: Errors reported by kubelet on master:

journalctl -u kubelet

Jun 21 10:28:49 k8s-master-ADB12ED9-0 docker[1622]: E0621 10:28:49.798140    1809 fsHandler.go:121] failed to collect filesystem stats - rootDiskErr: du command failed on /var/lib/docker/overlay/5cfff16d670f2df6520360595d7858fb5d16607b6999a88e5dcbc09e1e7ab9ce with output
Jun 21 10:28:49 k8s-master-ADB12ED9-0 docker[1622]: , stderr: du: cannot access '/var/lib/docker/overlay/5cfff16d670f2df6520360595d7858fb5d16607b6999a88e5dcbc09e1e7ab9ce/merged/proc/13122/task/13122/fd/4': No such file or directory
Jun 21 10:28:49 k8s-master-ADB12ED9-0 docker[1622]: du: cannot access '/var/lib/docker/overlay/5cfff16d670f2df6520360595d7858fb5d16607b6999a88e5dcbc09e1e7ab9ce/merged/proc/13122/task/13122/fdinfo/4': No such file or directory
Jun 21 10:28:49 k8s-master-ADB12ED9-0 docker[1622]: du: cannot access '/var/lib/docker/overlay/5cfff16d670f2df6520360595d7858fb5d16607b6999a88e5dcbc09e1e7ab9ce/merged/proc/13122/fd/3': No such file or directory
Jun 21 10:28:49 k8s-master-ADB12ED9-0 docker[1622]: du: cannot access '/var/lib/docker/overlay/5cfff16d670f2df6520360595d7858fb5d16607b6999a88e5dcbc09e1e7ab9ce/merged/proc/13122/fdinfo/3': No such file or directory
Jun 21 10:28:49 k8s-master-ADB12ED9-0 docker[1622]:  - exit status 1, rootInodeErr: <nil>, extraDiskErr: <nil>

Edit 2: more logs

kubectl logs $SERVICE_NAME -p                                                                                                    
done deploying telperion

Upvotes: 83

Answers (4)

systemBuilder

Reputation: 65

Today I had an unusual crash that produced no "previous" logs. I was copying a running pod, killing it, and then trying to relaunch it so that I could make some minor changes.

metadata:
  labels:
    pod-template-hash: 6ccd9955f5

When you copy a pod you typically have to delete creation-timestamp, uid, resourceVersion, ownerReference, and a few other automatically generated fields. Well, I was forgetting to delete the label 'pod-template-hash'. When you have the wrong pod-template-hash, the pod begins to start, then disappears without a trace - no debug logs, describe pod doesn't work, nil nothing nada. So this is a potential source of phantom crashes in your pod.

Upvotes: 0

Natan Yellin

Reputation: 6397

I wrote an open source tool to do this (robusta.dev). Here is example output in Slack (it can send to other destinations too):

More or less I'm doing the following:

Fetch container statuses and other useful information (restart counts, etc)
Fetch logs for the previous container's run (it just crashed so we want the previous logs, not the current logs)

I'm simplifying the code a little, but here is the Python code that implements the above:

@action
def restart_loop_reporter(event: PodEvent, config: RestartLoopParams):
    """
    When a pod is in restart loop, debug the issue, fetch the logs, and send useful information on the restart
    """
    pod = event.get_pod()
    crashed_container_statuses = get_crashing_containers(pod.status, config)

    # this callback runs on every pod update, so filter out ones without crashing pods
    if len(crashed_container_statuses) == 0:
        return  # no matched containers

    pod_name = pod.metadata.name

    # don't run this too frequently for the same crashing pod
    if not RateLimiter.mark_and_test(
        "restart_loop_reporter", pod_name + pod.metadata.namespace, config.rate_limit
    ):
        return

    # thi is the data we send to Slack / other destinations
    blocks: List[BaseBlock] = []
    for container_status in crashed_container_statuses:
        blocks.append(
            MarkdownBlock(
                f"*{container_status.name}* restart count: {container_status.restartCount}"
            )
        )
        
        ...
    
        container_log = pod.get_logs(container_status.name, previous=True)
        if container_log:
            blocks.append(FileBlock(f"{pod_name}.txt", container_log))
        else:
            blocks.append(
                MarkdownBlock(
                    f"Container logs unavailable for container: {container_status.name}"
                )
            )

    event.add_enrichment(blocks)

You can see the actual code on Github. without my simplifications.

Upvotes: -1