Reputation: 395
In my project, GKE runs many jobs daily. Sometimes I see that a job runs twice: the first time partially and the second time fully, although "restartPolicy: Never" is defined. It happens very seldom (about one time per 200 - 300 runs).
This is an example:
I 2020-12-03T00:12:45Z Started container mot-test-deleteoldvalidations-container
I 2020-12-03T00:12:45Z Created container mot-test-deleteoldvalidations-container
I 2020-12-03T00:12:45Z Successfully pulled image "gcr.io/xxxxx/mot-del-old-validations:v16"
I 2020-12-03T00:12:40Z Pulling image "gcr.io/xxxxx/mot-del-old-validations:v16"
I 2020-12-03T00:12:39Z Stopping container mot-test-deleteoldvalidations-container
I 2020-12-03T00:01:59Z Started container mot-test-deleteoldvalidations-container
I 2020-12-03T00:01:59Z Created container mot-test-deleteoldvalidations-container
I 2020-12-03T00:01:59Z Successfully pulled image "gcr.io/xxxx/mot-del-old-validations:v16"
I 2020-12-03T00:01:40Z Pulling image "gcr.io/xxxxx/mot-del-old-validations:v16"
From job's YAML:
spec:
backoffLimit: 0
completions: 1
parallelism: 1
resources:
limits:
cpu: "1"
memory: 2500Mi
requests:
cpu: 500m
memory: 2Gi
nsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
The reason for stopping container is "Killing". How can I avoid this behavior?
Upvotes: 2
Views: 260
Reputation: 14084
As you mention in comment section, your NetworkPolicy
is set to Never
. You have also set spec.backoffLimit
, spec.complementions
and spec.parallelism
which should work. However, the Documentation - Handling Pod and container failures mentioned that this behavior is possible and it's not considered as a problem.
Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", the same program may sometimes be started twice.
As addition, in CronJob documentation, the best practise is to make jobs Idempotent.
A cron job creates a job object about once per execution time of its schedule. We say "about" because there are certain circumstances where two jobs might be created, or no job might be created. We attempt to make these rare, but do not completely prevent them. Therefore, jobs should be idempotent.
In computing, an idempotent operation is one that has no additional effect if it is called more than once with the same input parameters. For example, removing an item from a set can be considered an idempotent operation on the set.
As your whole job manifest
is still a mystery, two workarounds come to my mind. Depends on the scenario it might help.
First workaround
Use PodAntiAffinity which won't allow deploy the second pod with the same label/selector.
Second workaround
Use initContainer lock, so the first pod puts a lock, and the second pod, if lock is detected wait for 3-5 seconds and exit.
Because init containers run to completion before any app containers start, init containers offer a mechanism to block or delay app container startup until a set of preconditions are met.
Upvotes: 1