Reputation: 856
I need some advise on an issue I am facing with k8s 1.14 and running gitlab pipelines on it. Many jobs are throwing up exit code 137 errors and I found that it means that the container is being terminated abruptly.
Cluster information:
Kubernetes version: 1.14 Cloud being used: AWS EKS Node: C5.4xLarge
After digging in, I found the below logs:
**kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).
**kubelet: E0114 03:37:08.653132** 4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes
**kubelet: W0114 03:37:23.240990** 4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up
**kubelet: W0114 00:15:51.106881** 4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.106907** 4781 container_gc.go:85] attempting to delete unused containers
**kubelet: I0114 00:15:51.116286** 4781 image_gc_manager.go:317] attempting to delete unused images
**kubelet: I0114 00:15:51.130499** 4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.130648** 4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:
1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)
3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)
4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)
5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)
And then the pods get terminated resulting in the exit code 137s.
Can anyone help me understand the reason and a possible solution to overcome this?
Thank you :)
Upvotes: 39
Views: 161132
Reputation: 11
I also got 'command terminated with exit code 137' when running python3 from pod. The problem was related to the Anti Virus that was killing the process when python script files were edited.
Upvotes: 1
Reputation: 1220
I encounter the problem: Last state: Terminated with 137: Error
and noticed that in the Recent Events, there was a failure for liveness probe : Liveness probe failed: Get "http://<ip:port>/actuator/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
. So that means it restarted because of the health check failure, which happened because i was debugging the service and everything blocked including the health check interface :D
Upvotes: 3
Reputation: 10398
Detailed Exit code 137
external signal
.SIGKILL
signal, meaning the process was killed forcibly.Hope this helps better.
Upvotes: 14
Reputation: 1035
Exit Code 137 does not necessarily mean OOMKilled. It indicates failure as container received SIGKILL (some interrupt or ‘oom-killer’ [OUT-OF-MEMORY])
If pod got OOMKilled, you will see below line when you describe the pod
State: Terminated
Reason: OOMKilled
Edit on 2/2/2022
I see that you added **kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).
and must evict pod(s) to reclaim ephemeral-storage
from the log. It usually happens when application pods are writing something to disk like log files. Admins can configure when (at what disk usage %) to do eviction.
Upvotes: 51
Reputation: 11
Check Jenkins's master node memory and CPU profile. in my case, it was a master under high memory and CPU utilization, and slaves were getting restarted with 137.
Upvotes: 0
Reputation: 241
137 mean that k8s kill container for some reason (may be it didn't pass liveness probe)
Cod 137 is 128 + 9(SIGKILL) process was killed by external signal
Upvotes: 24
Reputation: 31980
The typical causes for this error code can be system out of RAM, or a health check has failed
Upvotes: 11
Reputation: 856
Was able to solve the problem.
The nodes initially had 20G of ebs volume and on a c5.4xlarge instance type. I increased the ebs to 50 and 100G but that did not help as I kept seeing the below error:
"Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). "
I then changed the instance type to c5d.4xlarge which had 400GB of cache storage and gave 300GB of EBS. This solved the error.
Some of the gitlab jobs were for some java applications that were eating away lot of cache space and writing lot of logs.
Upvotes: 11