Reputation: 636
I have some celery tasks that run on VMs to run some web crawling tasks.
Python 3.6, Celery 4.2.1 with broker as Redis (self-managed). The same Redis server is used for caching and locks.
1. job_executor: This celery worker runs on a VM and listens to the queue crawl_job_{job_id}. This worker will execute the web crawling tasks. Only a single job_executor worker with concurrency = 1 runs on 1 VM. Each Crawl Job can have 1-20,000 URLs. Each Crawl Job can have anywhere between 1 to 100 VMs running in a GCP Managed Instance Group. The number of VMs to be run are defined in a configuration for each crawl job. Each task can take from 15 seconds to 120 minutes.
2. crawl_job_initiator: This celery worker runs on a separate VM and listens to the queue crawl_job_initiator_queue. One task creates the required MIG and the VMs using terraform for a single Crawl Job ID and adds the job_executor tasks to the crawl_job_{job_id} queue.
The task takes about 70 seconds to complete.
The concurrency for this worker was set to 1 so only 1 Crawl Job could be started at once.
To reduce the time it was taking to start large number of Crawl Jobs, I decided to increase the concurrency of crawl_job_initiator to 20 without changing any other configuration. I also added a lock mechanism at the job_id level so that other tasks do not interfere with the crawl_job_initiator task. The lock is acquired at the start of the task and is released once the task gets over. It is a non-blocking lock that retries a task after exponential backoff if the lock was not acquired.
Other tasks include a periodic task that deletes the VMs once the Crawl Job is finished.
After increasing the concurrency I started getting the following 2 errors:
On the crawl_job_initiator and other task logs:
consumer: Cannot connect to redis://:**@10.16.1.3:6379/0: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error..
On checking the redis server logs I found this:
# Can't save in background: fork: Cannot allocate memory
Increasing the redis server memory configuration solved the issue for now. I think this is also solvable by setting vm.overcommit_memory = 1 (which I have not done yet since everything is going fine till now).
Client id=43572 addr=10.128.1.218:57232 fd=7876 name= age=393 idle=385 flags=P db=0 sub=0 psub=1 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=2958 omem=48606000 events=rw cmd=psubscribe scheduled to be closed ASAP for overcoming of output buffer limits
The IP address is the IP for one of the MIG VMs running the job_executor tasks. This was also happening for the clients running crawl_job_initiator task.
I read more about this and found that increasing the client-output-buffer-limit
for pubsub clients will solve that.
Original setting: pubsub 16mb 8mb 60
Updated setting: pubsub 64mb 32mb 120
Even with this setting I started getting the same error. So I increased it by a lot so that the issue would be fixed for once:
pubsub 4000mb 2000mb 60
Since then I have been trying to figure out why this error is coming. I tried adding 100,000 tasks to a single job_executor queue to see if the buffer would be filled but that was not the case.
What can be the reason behind the error? How can I go ahead debugging this issue? Is there a straightforward fix for the same?
# job_executor supervisord config for celery worker
[crawl-job-executor]
command=/home/ubuntu/Env/bin/celery worker -A crawler.taskapp --loglevel=info --concurrency=1 --max-tasks-per-child=1 --max-memory-per-child=350000 -Ofair -Q crawl_job_{job_id} -n crawl_job_{job_id}
autostart=true
autorestart=true
startsecs=10
stopwaitsecs=10
# crawl_job_initiator supervisord config for celery worker
[crawl-job-initiator]
command=/home/ubuntu/Env/bin/celery worker -A crawler.taskapp --loglevel=info --concurrency=20 --max-tasks-per-child=1 --max-memory-per-child=350000 -Ofair -Q crawl_job_initiator -n crawl_job_initiator@%%h
autostart=true
autorestart=true
startsecs=10
stopwaitsecs=10
Upvotes: 3
Views: 2353