I have the following set-up: Generic worker pool with 100 workers High priority worker pool with 50 workers I used such large numbers because most of the time my tasks spend waiting for I/O with very long timeouts (doing HTTP requests that can take up to 20s to respond) Using RabbitMQ as the broker I have set up celeryd as a deamon using the init.d scripts from celery'd github, with the following parameters: CELERYD_OPTS="--time-limit=600 -c:low_p 100 -c:high_p 50 -Q:low_p low_priority_queue_name -Q:high_p high_priority_queue_name" My problem is, sometimes the queue seems to "back up"... that is it will stop consuming tasks. It seems there are to scenarios for this: There is a slow build-up of "unacknowledged" messages in the broker, even though celery inspect active will show that not all workers are used up - that is, I will only see a few active tasks The queue will just stop consuming new tasks, without the buildup. When in its "dead" state, using strace on the worker processes returns nothing... completely zero activity from the worker I would appreciate any information or pointers on: How I can debug it. I can use strace to see what the worker processes are doing, but so far that has been useful in telling me that the worker is hanging How I can monitor this, and possible do auto-recovery. There are many tools for managing celery ( flower and events but they are both excellent in real-time - but don't have any automated monitoring/alarming functionality). Am I just better off writing my own monitoring tools with supervisord ? Also, I am starting my tasks from django-celery

Reputation: 10249

How to monitor queue health in celery

I have the following set-up:

Generic worker pool with 100 workers
High priority worker pool with 50 workers
I used such large numbers because most of the time my tasks spend waiting for I/O with very long timeouts (doing HTTP requests that can take up to 20s to respond)
Using RabbitMQ as the broker
I have set up celeryd as a deamon using the init.d scripts from celery'd github, with the following parameters: CELERYD_OPTS="--time-limit=600 -c:low_p 100 -c:high_p 50 -Q:low_p low_priority_queue_name -Q:high_p high_priority_queue_name"

My problem is, sometimes the queue seems to "back up"... that is it will stop consuming tasks. It seems there are to scenarios for this:

There is a slow build-up of "unacknowledged" messages in the broker, even though celery inspect active will show that not all workers are used up - that is, I will only see a few active tasks
The queue will just stop consuming new tasks, without the buildup.
When in its "dead" state, using strace on the worker processes returns nothing... completely zero activity from the worker

I would appreciate any information or pointers on:

How I can debug it. I can use strace to see what the worker processes are doing, but so far that has been useful in telling me that the worker is hanging
How I can monitor this, and possible do auto-recovery. There are many tools for managing celery (flower and events but they are both excellent in real-time - but don't have any automated monitoring/alarming functionality). Am I just better off writing my own monitoring tools with supervisord?

Also, I am starting my tasks from django-celery

Upvotes: 13

Answers (3)

Artem Mezhenin

Reputation: 5757

@goro,if you are making requests to foreign services, you should try gevent or eventlet pool implementation instead of spawning 100500 workers. I also had problem, when celery workers stops consuming tasks, it was caused by a bug with celery+gevent+sentry(raven) combination.

One thing I figure out about Celery, is that it could work fine without any monitoring if all done right(currently I'm doing >50M tasks per day), but if it's not, monitoring will not help you very much. "Disaster recovery" in Celery is a bit tricky, not all things will work as you expect :(

You should break you solution on smaller peaces, may be separate some tasks between different queues. At some point, you'll find code snippet which cause problems.

Upvotes: 3

Vasiliy Faronov

Reputation: 12310

A very basic queue watchdog can be implemented with just a single script that’s run every minute by cron. First, it fires off a task that, when executed (in a worker), touches a predefined file, for example:

with open('/var/run/celery-heartbeat', 'w'):
    pass

Then the script checks the modification timestamp on that file and, if it’s more than a minute (or 2 minutes, or whatever) away, sends an alarm and/or restarts the workers and/or the broker.

It gets a bit trickier if you have multiple machines, but the same idea applies.

Upvotes: 4

holmars

Reputation: 31

I would think this is because of workers prefetching tasks. If this is still a problem you can update celery to 3.1 and use -Ofair worker option. The config option that I tried using before -Ofair was CELERYD_PREFETCH_MULTIPLIER. However, setting CELERYD_PREFETCH_MULTIPLIER = 1 (its lowest value) does not help since workers will still prefetch one task in advance.

See http://docs.celeryproject.org/en/latest/whatsnew-3.1.html#prefork-pool-improvements and especially http://docs.celeryproject.org/en/latest/whatsnew-3.1.html#caveats.

Upvotes: 3

How to monitor queue health in celery

Answers (3)

Related Questions