Reputation: 31
Python 3.8, Celery 4.4.4, Redis, Django
Looking for some guidance to help refine my restart logic and settings for Celery. I have a series of complex processes that make extensive use of Chords to parallelize long running programs. In AWS cloud we are running using kubernetes that have memory request set to slightly above average and limits set to slightly above observed maximums.
Given the complexity of environment and the highly variable size of the inputs (documents), we have occasional WorkerLostError when a task is killed because the needed memory is not available...even if it is below the specified limit.
My question: Is a killed worker subject to the same retry logic as other exceptions? If I have a setting for reject_on_worker_lost=True, will it be limited to the number of retries I specify in task decorator or is this a potential infinite loop if somehow the OS cannot ever find the needed memory? Will backoff and jitter apply? Are there event handlers that apply for this kind of OS driven exception?
Upvotes: 1
Views: 42
Reputation: 31
So after some trial and error, it appears that retry counts and logic do NOT apply for worker lost errors. I plan on taking an approach where I will set "reject_on_worker_lost=True" but also make sure that I pass an expires argument to chord callback in this case so that the risk of infinite loop in managed.
It would be great if the celery team updated documentation to deal with some of these realities in the cloud. Would love to see clear guidance on how to retry worker lost tasks while maintaining the retry functionality and or refine documentation of custom request/task on_failure() so that we can more easily do ourselves. I will probably examine all of my auto retry logic since there is no exception I am concerned with except for worker lost....
Upvotes: 0