tau1777
tau1777

Reputation: 131

Force Condor time out to have exit(0)

I've attached following line

periodic_remove = CurrentTime-EnteredCurrentStatus > 1200

to a condor sub file, and the job is aborted after 20mins, as I want. However, this sub file is part of DAG file, and because the job gets aborted the DAG file will not move onto the subsequent jobs.

Is there a way to make this timeout code, act like a success in the eyes of the DAG-scheduler, so that the scheduler starts the subsequent jobs?

1st Edit

I think I may have found a hint as to the answer:

You can use these expressions to automate many common actions. For example, suppose you know that your job will never run for more than an hour, and if it is running for more than an hour, something is probably wrong and will need investigating. Instead leaving your job running on the cluster needlessly, Condor can place your job on hold with the following added to the submit file:

periodic_hold = (ServerStartTime - JobStartDate) > 3600

Or suppose you have a job that occasionally segfaults but you know if you run it again on the same data, chances are it will finish successfully. You can get this behavior by adding this line to the submit file:

on_exit_remove = (ExitBySignal == True) && (ExitSignal != 11)

The above expression will not let the job leave the queue if it exited by a signal and that signal number was 11 (representing segmentation fault). In any other case of the job exting, it will leave the queue.

This information is part of an overall condor tutorial here: http://etutorials.org/Linux+systems/cluster+computing+with+linux/Part+III+Managing+Clusters/Chapter+15+Condor+A+Distributed+Job+Scheduler/15.2+Using+Condor/

Can anyone verify if this is the right track?

Upvotes: 1

Views: 280

Answers (1)

tau1777
tau1777

Reputation: 131

Using on_exit_remove, was not the key, I added

on_exit_remove =  (ExitCode == 1)

to the *.sub files after adding

periodic_hold = (ServerStartTime - JobStartDate) > 3600

because I was trying to force a job that was removed to be as seen as a success to the *.dag file, but this addition to the *.sub files caused my jobs to keep recycling in the queue and none of them were completed.

The solution was to make a POST script after the job I had removed. I suppose the script could be anything that evaluates to a success. I just used a bash file with simple echo command inside.

Basically, as it says here: http://research.cs.wisc.edu/htcondor/manual/v8.0/2_10DAGMan_Applications.html

under the section 2.10.2, bullet point SCRIPT, as long as a POST script evaluates to true the enter job will essentially have an exit code = 0.

Upvotes: 2

Related Questions