EightyEight
EightyEight

Reputation: 3460

Server becomes unresponsive periodically, OOM Killer inactive?

I'm hosting a Ruby application in a docker container on AWS. Unfortunately this Ruby application is known to leak memory so eventually it consumes all of the available memory.

I'm, perhaps naively, expecting OOM killer to get invoked and kill the Ruby process but nothing happens. Eventually the machine becomes unresponsive (web server doesn't respond, ssh is disabled). We force restart of the machine from the AWS console and get the following in the message the logs, so it is indeed alive at the time of the restart:

Apr 30 23:07:14 ip-10-0-10-24 init: serial (ttyS0) main process (2947) killed by TERM signal

I dont believe that this is resource exhaustion (ie running out of credits) in AWS. If I restart the application periodically the server never goes down.

I'm very much at a loss here; why would memory pressure be causing machines to lock up?

Upvotes: 10

Views: 1839

Answers (2)

Josnidhin
Josnidhin

Reputation: 12504

Apparently the solution I provided didn't seem to help the person who asked the question, but it might help someone else who stumbleupon here. The following are the 2 things I suggested which might be causing the problem.

Suggestions 1

I am guessing you are using the offical ruby docker image and when you run the container ruby is running as PID 1 inside the container.

If ruby is running as PID 1 then OOM killer wont be able to kill it, causing all the problem you are seeing.

To solve this problem you will have to make sure a proper init process runs as PID 1.

Docker 1.25 and above has the --init option for docker run command. This option will make sure that a proper init handles the tasks of PID 1, it will also pass all SIGNALs to your ruby application.

https://docs.docker.com/engine/reference/commandline/run/

--init API 1.25+ Run an init inside the container that forwards signals and reaps processes

The following is what docker uses as the init https://github.com/krallin/tini

Suggestion 2

There is a known issue with Amazon Linux AMI the details can be found at the following link https://github.com/aws/amazon-ecs-agent/issues/794. As of writing I am not sure if the problem with AMI was fixed or not.

So try a different AMI as suggested in that thread say the Ubuntu AMI.

Upvotes: 2

Darrell Plessas
Darrell Plessas

Reputation: 211

I think you are assuming that OOM will always target your Ruby application, but I don't think that is the case. You log line shows it killed you tty connection instead. I am betting it is killing other processes before your Ruby process and this is why your machine to seem un-responsive. You can read up on how OOM works and it might help here. I would look specifically at your oom_scores and see what you find there.

http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html

Good Luck

Upvotes: 1

Related Questions