user1418642
user1418642

Reputation: 41

Condor Timeout for idle jobs

I'm running jobs on a condor cluster, but some get hung in an idle state and never seem to start, let alone finish. Short of manually doing condor_wait -wait n logfile, then condor_rm, is there a more graceful (and automatic, built in) way of terminating a hung job?

Conversely, since these jobs are in a dagman, is there a way to timeout a job in a dagman so that the later jobs can run?

Upvotes: 4

Views: 1952

Answers (1)

user2313013
user2313013

Reputation: 41

Here are two ways to cause a job to be automatically removed after being idle for too long (24 hours in this example).

  1. Put the following in the submit file for the job:

    periodic_remove = JobStatus == 1 && CurrentTime-EnteredCurrentStatus > 3600*24

  2. Or put the following in the condor configuration file on the submit machine:

    SYSTEM_PERIODIC_REMOVE = JobStatus == 1 && CurrentTime-EnteredCurrentStatus > 3600*24

Of course, it would be better to understand why the jobs are remaining in the idle state. To do that, you may find condor_q -analyze jobid helpful.

Upvotes: 4

Related Questions