Monit times out when stopping unicorn workers if they respawn too quickly

Question

I'm trying to monitor unicorn workers with monit, so it gracefully kills them when they reach certain memory threshold.

The problem:

When I tell monit to restart a worker it first tries to stop it, firing my /etc/init.d/unicorn kill_worker 0 script command.

# my /etc/monit/config.d/unicorn file
check process orly_unicorn_worker_0 with pidfile /tmp/unicorn.orly.0.pid
  start program = "/bin/true"
  stop program = "/etc/init.d/unicorn_orly kill_worker 0"

As I am monitoring processes via the top command I see how the worker is killed and how the master spawns a new worker with, of course, another pid.

Monit, however, waits for a while and throws a "failed to stop" error in its log. It is actually waiting 30 seconds and timing out.

Once it times out, monit recognizes that the restart action is done, and then notices the worker PID has changed and continues to monitor the process as expected.

As a result everything is working, monit is able to restart a worker when needed and keep monitoring them, but the log is full of errors, the web interface shows a nasty (and confusing) execution failed error status on the worker, and I guess it would send erroneous email alerts if they were set up.

This is the relevant part of the log, when I try to restart a worker through the web interface (notice how it also gets confused with the workers parent PID):

[UTC Mar  5 13:29:17] info     : 'orly_unicorn_worker_0' trying to restart
[UTC Mar  5 13:29:17] info     : 'orly_unicorn_worker_0' stop: /etc/init.d/unicorn_orly
[UTC Mar  5 13:29:47] error    : 'orly_unicorn_worker_0' failed to stop
[UTC Mar  5 13:29:47] info     : 'orly_unicorn_worker_0' restart action done
[UTC Mar  5 13:29:47] error    : 'orly_unicorn_worker_0' process PID changed to 13699
[UTC Mar  5 13:29:49] error    : 'orly_unicorn_worker_0' process PPID changed to 0
[UTC Mar  5 13:30:19] info     : 'orly_unicorn_worker_0' process PID has not changed since last cycle
[UTC Mar  5 13:30:19] error    : 'orly_unicorn_worker_0' process PPID changed to 13660
[UTC Mar  5 13:30:49] info     : 'orly_unicorn_worker_0' process PPID has not changed since last cycle

This took me a long time to figure out but, what's happening here is that the worker gets killed and then respawned so quickly that monit doesn't even notice the change.

My guess is that monit, when performing the stop action, reads the /tmp/unicorn.orly.0.pid to get the pid of the process and then looks to see if that process stil exists.

However since the kill-respawn worker operation happens so fast monit doesn't realize that the pid of the worker has changed and keeps waiting for the (bran new) worker to die. Then it times out, then it realizes the pid has actually changed and it goes as normal.

The dirty solution I have found:

To prove this hypothesis I tried to slow down the mentioned kill-respawn worker operation. So I edited the unicorn config file to sleep the new workers a few seconds just before they write down their new pid in /tmp/unicorn.orly.0.pid.

I did it like this:

after_fork do |server, worker|
  sleep 3

  # write down the new worker PID so monit can monitor it
  child_pid = server.config[:pid].sub(".pid", ".#{worker.nr}.pid")
  system("echo #{Process.pid} > #{child_pid}")
end

And it worked wonderfully: birds and flowers sing in the sunny day, the web interface now shows a nice process running status, logs show everything is going smoothly, take a look:

[UTC Mar  5 13:30:44] info     : 'orly_unicorn_worker_0' trying to restart
[UTC Mar  5 13:30:44] info     : 'orly_unicorn_worker_0' stop: /etc/init.d/unicorn_orly
[UTC Mar  5 13:30:45] info     : 'orly_unicorn_worker_0' stopped
[UTC Mar  5 13:30:45] info     : 'orly_unicorn_worker_0' start: /bin/true
[UTC Mar  5 13:30:46] info     : 'orly_unicorn_worker_0' restart action done

The question:

Is there a monit-way of achieving this? Sleeping my workers for 3 seconds doesn't seem like a good solution. Any ideas?

I understand this is not the normal situation with monit. We have kind of broken the restart process cycle of monit, since we don't want the start program of monit to perform any action, but instead let the unicorn master process handle it (as explained here: http://www.stopdropandrew.com/2010/06/01/where-unicorns-go-to-die-watching-unicorn-workers-with-monit.html)

Monit times out when stopping unicorn workers if they respawn too quickly

Answers (1)

Related Questions