Reputation: 948
I'm using the K3s distribution of Kubernetes which is deployed on a Spot EC2 Instance in AWS.
I have scheduled a certain processing job and sometimes this job is being terminated and becomes in "Unknown" state (the job code is abnormally terminated)
kubectl describe pod <pod_name>
it shows this:
State: Terminated
Reason: Unknown
Exit Code: 255
Started: Wed, 06 Jan 2021 21:13:29 +0000
Finished: Wed, 06 Jan 2021 23:33:46 +0000
The AWS logs show that the CPU consumption was 99% right before the crash. From number of sources (1, 2, 3) I saw that this can be a reason of a node crash but didn't see that one, What may be the reason?
Thanks!
Upvotes: 3
Views: 2009
Reputation: 13878
The actual state of the Job is Terminated
with the Unknown
reason. In order to debug this situation you need to get a relevant logs from Pods created by your Job.
When a Job completes, no more Pods are created, but the Pods are not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output.
To do so, execute kubectl describe job $JOB
to see the Pods' names under the Events section and than execute kubectl logs $POD
.
If that won't be enough, you can try different ways to Debug Pods, such as:
Debugging with container exec
Debugging with an ephemeral debug container, or
Debugging via a shell on the node
The methods above will give you more info retarding the actual reasons behind the Job termination.
Upvotes: 2