Reputation: 19070
I was reading the Kubernetes documentation about jobs and retries. I found this:
There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s …) capped at six minutes. The back-off count is reset if no new failed Pods appear before the Job’s next status check.
I had two questions about the above quote:
Upvotes: 9
Views: 15006
Reputation: 76
By looking at the source code, it seems like the backoffLimit
attribute specifies the failure count rather than failure time.
Excerpt of the code mentioned above:
func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rErr error) {
// ...
succeeded, failed := getStatus(&job, pods, uncounted, expectedRmFinalizers)
// ...
jobHasNewFailure := failed > job.Status.Failed
exceedsBackoffLimit := jobHasNewFailure && (active != *job.Spec.Parallelism) &&
(failed > *job.Spec.BackoffLimit)
// ...
}
Upvotes: 0
Reputation: 5872
No confusion about the .spec.backoffLimit
is is the number of retries.
The Job controller recreates the failed Pods (associated with the Job) in an exponential delay (10s, 20s, 40s, ... , 360s). And of course, this delay time is set by the Job controller.
Upvotes: 9