Celery restart logic for WorkerLost

Question

Python 3.8, Celery 4.4.4, Redis, Django

Looking for some guidance to help refine my restart logic and settings for Celery. I have a series of complex processes that make extensive use of Chords to parallelize long running programs. In AWS cloud we are running using kubernetes that have memory request set to slightly above average and limits set to slightly above observed maximums.

Given the complexity of environment and the highly variable size of the inputs (documents), we have occasional WorkerLostError when a task is killed because the needed memory is not available...even if it is below the specified limit.

My question: Is a killed worker subject to the same retry logic as other exceptions? If I have a setting for reject_on_worker_lost=True, will it be limited to the number of retries I specify in task decorator or is this a potential infinite loop if somehow the OS cannot ever find the needed memory? Will backoff and jitter apply? Are there event handlers that apply for this kind of OS driven exception?

Celery restart logic for WorkerLost

Answers (1)

Related Questions