Reputation: 945
Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-13T02:40:46Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"e1d093448d0ed9b9b1a48f49833ff1ee64c05ba5", GitTreeState:"clean", BuildDate:"2021-06-03T00:20:57Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
I have a Kubernetes crobjob that serves the purpose of running some Azure cli commands on a time based schedule.
Running the container locally works fine, however, manually triggering the Cronjob through Lens, or letting it run per the schedule results in weird behaviour (Running in the cloud as a job yeilds unexpected results).
Here is the cronjob definition:
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
I ran the cronjob manually and it created job development-scale-down-manual-xwp1k
. Describing this job after it completed, we can see the following:
$ kubectl describe job development-scale-down-manual-xwp1k
Name: development-scale-down-manual-xwp1k
Namespace: development
Selector: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Annotations: <none>
Parallelism: 1
Completions: 1
Start Time: Wed, 04 Aug 2021 09:40:28 +1200
Active Deadline Seconds: 360s
Pods Statuses: 0 Running / 0 Succeeded / 1 Failed
Pod Template:
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Containers:
scaler:
Image: myimage:latest
Port: <none>
Host Port: <none>
Environment:
CLUSTER_NAME: ...
NODEPOOL_NAME: ...
NODEPOOL_SIZE: ...
RESOURCE_GROUP: ...
SP_APP_ID: <set to the key 'application_id' in secret 'scaler-secrets'> Optional: false
SP_PASSWORD: <set to the key 'application_pass' in secret 'scaler-secrets'> Optional: false
SP_TENANT: <set to the key 'application_tenant' in secret 'scaler-secrets'> Optional: false
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 24m job-controller Created pod: development-scale-down-manual-xwp1k-b858c
Normal SuccessfulCreate 23m job-controller Created pod: development-scale-down-manual-xwp1k-xkkw9
Warning BackoffLimitExceeded 23m job-controller Job has reached the specified backoff limit
This differs from other issues I have read, where it does not mention a "SuccessfulDelete" event.
The events received from kubectl get events
tell an interesting story
$ ktl get events | grep xwp1k
3m19s Normal Scheduled pod/development-scale-down-manual-xwp1k-b858c Successfully assigned development/development-scale-down-manual-xwp1k-b858c to aks-burst-37275452-vmss00000d
3m18s Normal Pulling pod/development-scale-down-manual-xwp1k-b858c Pulling image "myimage:latest"
2m38s Normal Pulled pod/development-scale-down-manual-xwp1k-b858c Successfully pulled image "myimage:latest" in 40.365655229s
2m23s Normal Created pod/development-scale-down-manual-xwp1k-b858c Created container myimage
2m23s Normal Started pod/development-scale-down-manual-xwp1k-b858c Started container myimage
2m12s Normal Killing pod/development-scale-down-manual-xwp1k-b858c Stopping container myimage
2m12s Normal Scheduled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully assigned development/development-scale-down-manual-xwp1k-xkkw9 to aks-default-37275452-vmss000002
2m12s Normal Pulling pod/development-scale-down-manual-xwp1k-xkkw9 Pulling image "myimage:latest"
2m11s Normal Pulled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully pulled image "myimage:latest" in 751.93652ms
2m10s Normal Created pod/development-scale-down-manual-xwp1k-xkkw9 Created container myimage
2m10s Normal Started pod/development-scale-down-manual-xwp1k-xkkw9 Started container myimage
3m19s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-b858c
2m12s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-xkkw9
2m1s Warning BackoffLimitExceeded job/development-scale-down-manual-xwp1k Job has reached the specified backoff limit
I cant figure out why the container was killed, the logs all seem fine and there are no resource constraints. The container is removed very quickly meaning I have very little time to debug. The more verbose event line reads as such
3m54s Normal Killing pod/development-scale-down-manual-xwp1k-b858c spec.containers{myimage} kubelet, aks-burst-37275452-vmss00000d Stopping container myimage 3m54s 1 development-scale-down-manual-xwp1k-b858c.1697e9d5e5b846ef
I note that the image pull takes a good few seconds (40) initially, might this aid in exceeding the startingDeadline or another cron spec?
Any thoughts or help appreciated, thank you
Upvotes: 0
Views: 4956
Reputation: 1
We faced the same problem in our infrastructure.
In that case the root cause was the cluster autoscaling that was killing and deleting jobs in order to scale down the cluster and free one (or more) nodes.
In that case we solve the issue using the "safe-to-evic" annotation that prevent k8s to kill the job due to the autoscaler.
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
In our case, we blamed the autoscaler just after had a look to the k8s 'events' of that namespace.
Upvotes: 0
Reputation: 945
Reading logs! Always helpful.
For context, the job itself scales an AKS nodepool. We have two, the default system
one, and a new user controlled one. The cronjob is meant to scale the new user
(Not system
pool).
I noticed that the scale-down
job always takes longer compared to the scale-up
job, this is due to the image pull always happening when the scale down job runs.
I also noticed that the Killing
event mentioned above originates from the kubelet. (kubectl get events -o wide
)
I went to check the kubelet logs on the host, and realised that the host name was a little atypical (aks-burst-XXXXXXXX-vmss00000d
) in the sense that most hosts in our small development cluster usually has numbers on the end, not d
There I realised the naming was different because this node was not part of the default nodepool, and I could not check the kubelet logs because the host had been removed.
The job scales down compute resources. The scale down would fail, because it was always preceeded by a scale up, in which point a new node was in the cluster. This node had nothing running on it, so the next Job was scheduled on it. The Job started on the new node, told Azure to scale down the new node to 0, and subsequently the Kubelet killed the job as it was running.
Always being scheduled on the new node explains why the image pull happened each time as well.
I changed the spec and added a NodeSelector so that the Job would always run on the system
pool, which is more stable than the user
pool
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
nodeSelector:
agentpool: default
Upvotes: 2