YYashwanth
YYashwanth

Reputation: 856

Kubernetes Pods Terminated - Exit Code 137

I need some advise on an issue I am facing with k8s 1.14 and running gitlab pipelines on it. Many jobs are throwing up exit code 137 errors and I found that it means that the container is being terminated abruptly.


Cluster information:

Kubernetes version: 1.14 Cloud being used: AWS EKS Node: C5.4xLarge


After digging in, I found the below logs:

**kubelet: I0114 03:37:08.639450**  4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).

**kubelet: E0114 03:37:08.653132**  4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes

**kubelet: W0114 03:37:23.240990**  4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up

**kubelet: W0114 00:15:51.106881**   4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage

**kubelet: I0114 00:15:51.106907**   4781 container_gc.go:85] attempting to delete unused containers

**kubelet: I0114 00:15:51.116286**   4781 image_gc_manager.go:317] attempting to delete unused images

**kubelet: I0114 00:15:51.130499**   4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage 

**kubelet: I0114 00:15:51.130648**   4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:

 1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
 2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)

 3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)

 4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)

 5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)

And then the pods get terminated resulting in the exit code 137s.

Can anyone help me understand the reason and a possible solution to overcome this?

Thank you :)

Upvotes: 39

Views: 161132

Answers (8)

Eti Tocatly
Eti Tocatly

Reputation: 11

I also got 'command terminated with exit code 137' when running python3 from pod. The problem was related to the Anti Virus that was killing the process when python script files were edited.

Upvotes: 1

puppylpg
puppylpg

Reputation: 1220

I encounter the problem: Last state: Terminated with 137: Error and noticed that in the Recent Events, there was a failure for liveness probe : Liveness probe failed: Get "http://<ip:port>/actuator/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers). So that means it restarted because of the health check failure, which happened because i was debugging the service and everything blocked including the health check interface :D

Upvotes: 3

Gupta
Gupta

Reputation: 10398

Detailed Exit code 137

  1. It denotes that the process was terminated by an external signal.
  2. The number 137 is a sum of two numbers: 128+x, # where x is the signal number sent to the process that caused it to terminate.
  3. In the example, x equals 9, which is the number of the SIGKILL signal, meaning the process was killed forcibly.

Hope this helps better.

Upvotes: 14

ffran09
ffran09

Reputation: 1035

Exit Code 137 does not necessarily mean OOMKilled. It indicates failure as container received SIGKILL (some interrupt or ‘oom-killer’ [OUT-OF-MEMORY])

If pod got OOMKilled, you will see below line when you describe the pod

      State:        Terminated
      Reason:       OOMKilled

Edit on 2/2/2022 I see that you added **kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). and must evict pod(s) to reclaim ephemeral-storage from the log. It usually happens when application pods are writing something to disk like log files. Admins can configure when (at what disk usage %) to do eviction.

Upvotes: 51

Check Jenkins's master node memory and CPU profile. in my case, it was a master under high memory and CPU utilization, and slaves were getting restarted with 137.

Upvotes: 0

werewolf
werewolf

Reputation: 241

137 mean that k8s kill container for some reason (may be it didn't pass liveness probe)

Cod 137 is 128 + 9(SIGKILL) process was killed by external signal

Upvotes: 24

Chris Halcrow
Chris Halcrow

Reputation: 31980

The typical causes for this error code can be system out of RAM, or a health check has failed

Upvotes: 11

YYashwanth
YYashwanth

Reputation: 856

Was able to solve the problem.

The nodes initially had 20G of ebs volume and on a c5.4xlarge instance type. I increased the ebs to 50 and 100G but that did not help as I kept seeing the below error:

"Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). "

I then changed the instance type to c5d.4xlarge which had 400GB of cache storage and gave 300GB of EBS. This solved the error.

Some of the gitlab jobs were for some java applications that were eating away lot of cache space and writing lot of logs.

Upvotes: 11

Related Questions