Reputation: 1036
I'm trying to monitor unicorn workers with monit, so it gracefully kills them when they reach certain memory threshold.
The problem:
When I tell monit to restart a worker it first tries to stop it, firing my /etc/init.d/unicorn kill_worker 0
script command.
# my /etc/monit/config.d/unicorn file
check process orly_unicorn_worker_0 with pidfile /tmp/unicorn.orly.0.pid
start program = "/bin/true"
stop program = "/etc/init.d/unicorn_orly kill_worker 0"
As I am monitoring processes via the top
command I see how the worker is killed and how the master spawns a new worker with, of course, another pid.
Monit, however, waits for a while and throws a "failed to stop" error in its log. It is actually waiting 30 seconds and timing out.
Once it times out, monit recognizes that the restart action is done
, and then notices the worker PID has changed and continues to monitor the process as expected.
As a result everything is working, monit is able to restart a worker when needed and keep monitoring them, but the log is full of errors, the web interface shows a nasty (and confusing) execution failed
error status on the worker, and I guess it would send erroneous email alerts if they were set up.
This is the relevant part of the log, when I try to restart a worker through the web interface (notice how it also gets confused with the workers parent PID):
[UTC Mar 5 13:29:17] info : 'orly_unicorn_worker_0' trying to restart
[UTC Mar 5 13:29:17] info : 'orly_unicorn_worker_0' stop: /etc/init.d/unicorn_orly
[UTC Mar 5 13:29:47] error : 'orly_unicorn_worker_0' failed to stop
[UTC Mar 5 13:29:47] info : 'orly_unicorn_worker_0' restart action done
[UTC Mar 5 13:29:47] error : 'orly_unicorn_worker_0' process PID changed to 13699
[UTC Mar 5 13:29:49] error : 'orly_unicorn_worker_0' process PPID changed to 0
[UTC Mar 5 13:30:19] info : 'orly_unicorn_worker_0' process PID has not changed since last cycle
[UTC Mar 5 13:30:19] error : 'orly_unicorn_worker_0' process PPID changed to 13660
[UTC Mar 5 13:30:49] info : 'orly_unicorn_worker_0' process PPID has not changed since last cycle
This took me a long time to figure out but, what's happening here is that the worker gets killed and then respawned so quickly that monit doesn't even notice the change.
My guess is that monit, when performing the stop action, reads the /tmp/unicorn.orly.0.pid
to get the pid of the process and then looks to see if that process stil exists.
However since the kill-respawn worker operation happens so fast monit doesn't realize that the pid of the worker has changed and keeps waiting for the (bran new) worker to die. Then it times out, then it realizes the pid has actually changed and it goes as normal.
The dirty solution I have found:
To prove this hypothesis I tried to slow down the mentioned kill-respawn worker operation. So I edited the unicorn config file to sleep the new workers a few seconds just before they write down their new pid in /tmp/unicorn.orly.0.pid
.
I did it like this:
after_fork do |server, worker|
sleep 3
# write down the new worker PID so monit can monitor it
child_pid = server.config[:pid].sub(".pid", ".#{worker.nr}.pid")
system("echo #{Process.pid} > #{child_pid}")
end
And it worked wonderfully: birds and flowers sing in the sunny day, the web interface now shows a nice process running
status, logs show everything is going smoothly, take a look:
[UTC Mar 5 13:30:44] info : 'orly_unicorn_worker_0' trying to restart
[UTC Mar 5 13:30:44] info : 'orly_unicorn_worker_0' stop: /etc/init.d/unicorn_orly
[UTC Mar 5 13:30:45] info : 'orly_unicorn_worker_0' stopped
[UTC Mar 5 13:30:45] info : 'orly_unicorn_worker_0' start: /bin/true
[UTC Mar 5 13:30:46] info : 'orly_unicorn_worker_0' restart action done
The question:
Is there a monit-way of achieving this? Sleeping my workers for 3 seconds doesn't seem like a good solution. Any ideas?
I understand this is not the normal situation with monit. We have kind of broken the restart process cycle of monit, since we don't want the start program
of monit to perform any action, but instead let the unicorn master process handle it (as explained here: http://www.stopdropandrew.com/2010/06/01/where-unicorns-go-to-die-watching-unicorn-workers-with-monit.html)
Upvotes: 2
Views: 1075
Reputation: 1
In our environment, monit monitors the unicorn master, and the unicorn master monitors its children. We use a simple cron to monitor unicorn workers, killing them if the memory threshold is exceeded:
#!/usr/bin/env ruby
#
def get_mem(pid)
pid = pid.to_i
mem = 0
if File.exist?("/proc/#{pid}/status")
File.read("/proc/#{pid}/status").each_line do |status|
next unless status =~ /^VmRSS:\s+(\d+) kb/i
mem = $1.to_i / 1024
end
end
mem
end
%x{pgrep -f 'unicorn worker'}.each_line do |pid|
Process.kill('QUIT', pid.to_i) if (get_mem pid) >= 300
end
The unicorn master notices when the child has been killed and automagically respawns a new one. I am pretty sure the unicorn worker honors a QUIT signal shutting down after the current request completes.
Upvotes: 0