TheQueenIsDead
TheQueenIsDead

Reputation: 945

Why does my Kubernetes Cronjob pod get killed while executing?

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-13T02:40:46Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"e1d093448d0ed9b9b1a48f49833ff1ee64c05ba5", GitTreeState:"clean", BuildDate:"2021-06-03T00:20:57Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

I have a Kubernetes crobjob that serves the purpose of running some Azure cli commands on a time based schedule.

Running the container locally works fine, however, manually triggering the Cronjob through Lens, or letting it run per the schedule results in weird behaviour (Running in the cloud as a job yeilds unexpected results).

Here is the cronjob definition:

---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: development-scale-down
  namespace: development
spec:
  schedule: "0 22 * * 0-4"
  concurrencyPolicy: Allow
  startingDeadlineSeconds: 60
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      backoffLimit: 0 # Do not retry
      activeDeadlineSeconds: 360 # 5 minutes
      template:
        spec:
          containers:
          - name: scaler
            image: myimage:latest
            imagePullPolicy: Always
            env: ...
          restartPolicy: "Never"

I ran the cronjob manually and it created job development-scale-down-manual-xwp1k. Describing this job after it completed, we can see the following:

$ kubectl describe job development-scale-down-manual-xwp1k
Name:                     development-scale-down-manual-xwp1k
Namespace:                development
Selector:                 controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
Labels:                   controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
                          job-name=development-scale-down-manual-xwp1k
Annotations:              <none>
Parallelism:              1
Completions:              1
Start Time:               Wed, 04 Aug 2021 09:40:28 +1200
Active Deadline Seconds:  360s
Pods Statuses:            0 Running / 0 Succeeded / 1 Failed
Pod Template:
  Labels:  controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
           job-name=development-scale-down-manual-xwp1k
  Containers:
   scaler:
    Image:      myimage:latest
    Port:       <none>
    Host Port:  <none>
    Environment:
      CLUSTER_NAME:    ...
      NODEPOOL_NAME:   ...
      NODEPOOL_SIZE:   ...
      RESOURCE_GROUP:  ...
      SP_APP_ID:       <set to the key 'application_id' in secret 'scaler-secrets'>      Optional: false
      SP_PASSWORD:     <set to the key 'application_pass' in secret 'scaler-secrets'>    Optional: false
      SP_TENANT:       <set to the key 'application_tenant' in secret 'scaler-secrets'>  Optional: false
    Mounts:            <none>
  Volumes:             <none>
Events:
  Type     Reason                Age   From            Message
  ----     ------                ----  ----            -------
  Normal   SuccessfulCreate      24m   job-controller  Created pod: development-scale-down-manual-xwp1k-b858c
  Normal   SuccessfulCreate      23m   job-controller  Created pod: development-scale-down-manual-xwp1k-xkkw9
  Warning  BackoffLimitExceeded  23m   job-controller  Job has reached the specified backoff limit

This differs from other issues I have read, where it does not mention a "SuccessfulDelete" event.

The events received from kubectl get events tell an interesting story

$ ktl get events | grep xwp1k
3m19s       Normal    Scheduled                  pod/development-scale-down-manual-xwp1k-b858c   Successfully assigned development/development-scale-down-manual-xwp1k-b858c to aks-burst-37275452-vmss00000d
3m18s       Normal    Pulling                    pod/development-scale-down-manual-xwp1k-b858c   Pulling image "myimage:latest"
2m38s       Normal    Pulled                     pod/development-scale-down-manual-xwp1k-b858c   Successfully pulled image "myimage:latest" in 40.365655229s
2m23s       Normal    Created                    pod/development-scale-down-manual-xwp1k-b858c   Created container myimage
2m23s       Normal    Started                    pod/development-scale-down-manual-xwp1k-b858c   Started container myimage
2m12s       Normal    Killing                    pod/development-scale-down-manual-xwp1k-b858c   Stopping container myimage
2m12s       Normal    Scheduled                  pod/development-scale-down-manual-xwp1k-xkkw9   Successfully assigned development/development-scale-down-manual-xwp1k-xkkw9 to aks-default-37275452-vmss000002
2m12s       Normal    Pulling                    pod/development-scale-down-manual-xwp1k-xkkw9   Pulling image "myimage:latest"
2m11s       Normal    Pulled                     pod/development-scale-down-manual-xwp1k-xkkw9   Successfully pulled image "myimage:latest" in 751.93652ms
2m10s       Normal    Created                    pod/development-scale-down-manual-xwp1k-xkkw9   Created container myimage
2m10s       Normal    Started                    pod/development-scale-down-manual-xwp1k-xkkw9   Started container myimage
3m19s       Normal    SuccessfulCreate           job/development-scale-down-manual-xwp1k         Created pod: development-scale-down-manual-xwp1k-b858c
2m12s       Normal    SuccessfulCreate           job/development-scale-down-manual-xwp1k         Created pod: development-scale-down-manual-xwp1k-xkkw9
2m1s        Warning   BackoffLimitExceeded       job/development-scale-down-manual-xwp1k         Job has reached the specified backoff limit

I cant figure out why the container was killed, the logs all seem fine and there are no resource constraints. The container is removed very quickly meaning I have very little time to debug. The more verbose event line reads as such

3m54s       Normal    Killing                    pod/development-scale-down-manual-xwp1k-b858c   spec.containers{myimage}   kubelet, aks-burst-37275452-vmss00000d                                 Stopping container myimage                                                                                                                                                       3m54s        1       development-scale-down-manual-xwp1k-b858c.1697e9d5e5b846ef

I note that the image pull takes a good few seconds (40) initially, might this aid in exceeding the startingDeadline or another cron spec?

Any thoughts or help appreciated, thank you

Upvotes: 0

Views: 4956

Answers (2)

Davide Giudici
Davide Giudici

Reputation: 1

We faced the same problem in our infrastructure.

In that case the root cause was the cluster autoscaling that was killing and deleting jobs in order to scale down the cluster and free one (or more) nodes.

In that case we solve the issue using the "safe-to-evic" annotation that prevent k8s to kill the job due to the autoscaler.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md

In our case, we blamed the autoscaler just after had a look to the k8s 'events' of that namespace.

Upvotes: 0

TheQueenIsDead
TheQueenIsDead

Reputation: 945

Reading logs! Always helpful.

Context

For context, the job itself scales an AKS nodepool. We have two, the default system one, and a new user controlled one. The cronjob is meant to scale the new user (Not system pool).

Investigating

I noticed that the scale-down job always takes longer compared to the scale-up job, this is due to the image pull always happening when the scale down job runs.

I also noticed that the Killing event mentioned above originates from the kubelet. (kubectl get events -o wide)

I went to check the kubelet logs on the host, and realised that the host name was a little atypical (aks-burst-XXXXXXXX-vmss00000d) in the sense that most hosts in our small development cluster usually has numbers on the end, not d

There I realised the naming was different because this node was not part of the default nodepool, and I could not check the kubelet logs because the host had been removed.

Cause

The job scales down compute resources. The scale down would fail, because it was always preceeded by a scale up, in which point a new node was in the cluster. This node had nothing running on it, so the next Job was scheduled on it. The Job started on the new node, told Azure to scale down the new node to 0, and subsequently the Kubelet killed the job as it was running.

Always being scheduled on the new node explains why the image pull happened each time as well.

Fix

I changed the spec and added a NodeSelector so that the Job would always run on the system pool, which is more stable than the user pool

---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: development-scale-down
  namespace: development
spec:
  schedule: "0 22 * * 0-4"
  concurrencyPolicy: Allow
  startingDeadlineSeconds: 60
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      backoffLimit: 0 # Do not retry
      activeDeadlineSeconds: 360 # 5 minutes
      template:
        spec:
          containers:
          - name: scaler
            image: myimage:latest
            imagePullPolicy: Always
            env: ...
          restartPolicy: "Never"
          nodeSelector:
            agentpool: default

Upvotes: 2

Related Questions