Reputation: 6549
I'm currently working on porting an existing single server Django based web project to Amazon Elastic Beanstalk. So far, I've successfully setup the project to use RDS, Elastic Search, Simple Email Service, and S3 without too much trouble. I'm using Code Deploy to build a Docker container for the Django project and deploy to an Elastic Beanstalk environment. All of this works beautifully but I'm running into an issue trying to get a Elastic Beanstalk worker environment to work well with this setup.
I'm deploying the same Docker container to my worker environments but with a different start point to run celery -A project worker -l INFO
instead of gunicorn config.wsgi --bind 0.0.0.0:5000 --chdir=/app --workers 3
. This seems to work; the worker consumes messages and processes them just fine, but it frequently appears to stop work for minutes at a time, even when there's a backlog of messages waiting in the queue.
During my testing, I'm trying to run my invoice generation routine that queues up a message for each account's invoice by using a Celery group
in a chain
so it will process the invoices, then email me a "completed" notice. In total, I have about 250 messages in the queue at the outset. Tailing the celery logs for the Docker container I can see groups of anywhere between 8-12 messages getting picked up, then processed within a second or two, but then the worker goes idle for several minutes at a time. Usually about 4 minutes.
I'm not seeing any errors anywhere that I've can think to look.
I've also experimented with scaling the worker environment up so that it's running multiple worker nodes, but this just spread the issue across multiple nodes. ie, instead of one worker picking up 8-12 messages, two workers pickup between 4-6 messages, process them, then go idle.
At this point, I have no idea what I should be looking at anymore and I'm contemplating doing away with the worker environment altogether. Maybe it'd make more sense to just run the Celery worker process in same environment as the web server? I'd prefer not to do that as I was thinking it'd be much easier to setup scaling rules for the web server and workers independently, but it's starting to look like I'll have no other choice.
Is there something I'm missing in this setup or some reason that the Celery worker environment is behaving in this way?
Upvotes: 2
Views: 471
Reputation: 860
In case anyone stumbles upon this.
https://github.com/celery/celery/issues/6352
This was the culprit for this whole fiasco. This is what we use in our system. We are on 4.4.0 (impacted version). Was fixed on version 5.1.0
I ran the same test on two versions (4.4.0 and 5.3.1). Where its fixed, its literally working 1000% faster. Literally 1000, its not Exaggeration
Upvotes: 0
Reputation: 1875
Given that changing the number of celery workers or nodes doesn't change the delay, it leads me to believe that the issue is somewhere in how a given celery worker is attempting to pull tasks off the SQS queue.
With a 4 minute timeout, it seems awfully close to the default retry delay present in Celery's Task.default_retry_delay
, which is 3 minutes. It could also be related to Task.rate_limit
, the config parameter that will throttle the total number of tasks celery workers will accept in a given unit of time.
As a first step, I would go into your celery config file and manually change these two values -- make them higher and see how it effects the timeout or changes the application throughput.
Upvotes: 2