Reputation: 805
What is the best way to wait for kubernetes job to be complete? I noticed a lot of suggestions to use:
kubectl wait --for=condition=complete job/myjob
but i think that only works if the job is successful. if it fails, i have to do something like:
kubectl wait --for=condition=failed job/myjob
is there a way to wait for both conditions using wait? if not, what is the best way to wait for a job to either succeed or fail?
Upvotes: 53
Views: 44322
Reputation: 37
Here is what I did. I label my jobs so that I can use the labels to find them. I wait for the labeled job(s) to have status.ready=0
k wait -l label=value --for=jsonpath='{.status.ready}'=0 job
You can then use the following to find out if the first job in the items returned by the labels failed
failed=$(k get jobs -l label=value -o jsonpath={.items[0].status.failed})
exit ${failed}
Upvotes: 0
Reputation: 886
Run the first wait condition as a subprocess and capture its PID. If the condition is met, this process will exit with an exit code of 0.
kubectl wait --for=condition=complete job/myjob &
completion_pid=$!
Do the same for the failure wait condition. The trick here is to add && exit 1
so that the subprocess returns a non-zero exit code when the job fails.
kubectl wait --for=condition=failed job/myjob && exit 1 &
failure_pid=$!
Then use the Bash builtin wait -n $PID1 $PID2
to wait for one of the conditions to succeed. The command will capture the exit code of the first process to exit:
MAC USERS! Note that
wait -n [...PID]
requires Bash version 4.3 or higher. MacOS is forever stuck on version 3.2 due to license issues. Please see this Stackoverflow Post on how to install the latest version.
wait -n $completion_pid $failure_pid
Finally, you can check the actual exit code of wait -n
to see whether the job failed or not:
exit_code=$?
if (( $exit_code == 0 )); then
echo "Job completed"
else
echo "Job failed with exit code ${exit_code}, exiting..."
fi
exit $exit_code
Complete example:
# wait for completion as background process - capture PID
kubectl wait --for=condition=complete job/myjob &
completion_pid=$!
# wait for failure as background process - capture PID
kubectl wait --for=condition=failed job/myjob && exit 1 &
failure_pid=$!
# capture exit code of the first subprocess to exit
wait -n $completion_pid $failure_pid
# store exit code in variable
exit_code=$?
if (( $exit_code == 0 )); then
echo "Job completed"
else
echo "Job failed with exit code ${exit_code}, exiting..."
fi
exit $exit_code
Upvotes: 49
Reputation: 8876
You can use the following workaround using kubectl logs --follow
:
kubectl wait --for=condition=ready pod --selector=job-name=YOUR_JOB_NAME --timeout=-1s
kubectl logs --follow job/YOUR_JOB_NAME
It will terminate when your job terminates, with any status.
Upvotes: 4
Reputation: 7789
The wait -n
approach does not work for me as I need it to work both on Linux and Mac.
I improved on the answer provided by Clayton a little, because his script would not work with set -e -E
enabled. The following will work even in that case.
while true; do
if kubectl wait --for=condition=complete --timeout=0 job/name 2>/dev/null; then
job_result=0
break
fi
if kubectl wait --for=condition=failed --timeout=0 job/name 2>/dev/null; then
job_result=1
break
fi
sleep 3
done
if [[ $job_result -eq 1 ]]; then
echo "Job failed!"
exit 1
fi
echo "Job succeeded"
You might want to add a timeout to avoid the infinite loop, depends on your situation.
Upvotes: 9
Reputation: 401
You can leverage the behaviour when --timeout=0
.
In this scenario, the command line returns immediately with either result code 0 or 1. Here's an example:
retval_complete=1
retval_failed=1
while [[ $retval_complete -ne 0 ]] && [[ $retval_failed -ne 0 ]]; do
sleep 5
output=$(kubectl wait --for=condition=failed job/job-name --timeout=0 2>&1)
retval_failed=$?
output=$(kubectl wait --for=condition=complete job/job-name --timeout=0 2>&1)
retval_complete=$?
done
if [ $retval_failed -eq 0 ]; then
echo "Job failed. Please check logs."
exit 1
fi
So when either condition=failed
or condition=complete
is true, execution will exit the while loop (retval_complete
or retval_failed
will be 0
).
Next, you only need to check and act on the condition you want. In my case, I want to fail fast and stop execution when the job fails.
Upvotes: 11
Reputation: 4683
kubectl wait --for=condition=<condition name
is waiting for a specific condition, so afaik it can not specify multiple conditions at the moment.
My workaround is using oc get --wait
, --wait
is closed the command if the target resource is updated. I will monitor status
section of the job using oc get --wait
until status
is updated. Update of status
section is meaning the Job is complete with some status conditions.
If the job complete successfully, then status.conditions.type
is updated immediately as Complete
. But if the job is failed then the job pod will be restarted automatically regardless restartPolicy
is OnFailure
or Never
. But we can deem the job is Failed
status if not to updated as Complete
after first update.
Look the my test evidence as follows.
# vim job.yml apiVersion: batch/v1 kind: Job metadata: name: pi spec: parallelism: 1 completions: 1 template: metadata: name: pi spec: containers: - name: pi image: perl command: ["perl", "-wle", "exit 0"] restartPolicy: Never
Complete
if it complete the job successfully.# oc create -f job.yml && oc get job/pi -o=jsonpath='{.status}' -w && oc get job/pi -o=jsonpath='{.status.conditions[*].type}' | grep -i -E 'failed|complete' || echo "Failed" job.batch/pi created map[startTime:2019-03-09T12:30:16Z active:1]Complete
# vim job.yml apiVersion: batch/v1 kind: Job metadata: name: pi spec: parallelism: 1 completions: 1 template: metadata: name: pi spec: containers: - name: pi image: perl command: ["perl", "-wle", "exit 1"] restartPolicy: Never
Failed
if the first job update is not Complete
. Test if after delete the existing job resource.# oc delete job pi job.batch "pi" deleted # oc create -f job.yml && oc get job/pi -o=jsonpath='{.status}' -w && oc get job/pi -o=jsonpath='{.status.conditions[*].type}' | grep -i -E 'failed|complete' || echo "Failed" job.batch/pi created map[active:1 startTime:2019-03-09T12:31:05Z]Failed
I hope it help you. :)
Upvotes: 4