zje
zje

Reputation: 3912

Sun Grid Engine suspend instead of restart jobs

This may be a cluster specific issue that can only be addressed by an admin, but when I have a low priority job and a high priority one comes along, the process is killed.

When the high priority job finishes, the low priority job is restarted. Is there a way on the user end to make it suspend on the machine it was originally started on via SIGSTOP or something without killing the process? Unfortunately, checkpointing is not an option here so I would like to be able to hold the job without throwing away what's in memory.

We do have ssh to this machine, so if all else fails, I'm tempted just to do a really sloppy scripting hack to get the desired behavior:

1. start the process locally
2. send a SIGSTOP
3. make the job script send SIGCONT and just spin watching the process
4. when the job gets suspended, send a SIGSTOP again
5. when the job gets resumed, it should just send a SIGCONT

but I would much rather do everything within SGE to avoid any nasty surprises

Upvotes: 0

Views: 1622

Answers (1)

Hristo Iliev
Hristo Iliev

Reputation: 74475

The suspend/stop mechanism in SGE is controlled on per queue basis by the properties suspend_method, resume_method and terminate_method. The defaults are:

  • suspend_method - send SIGSTOP
  • resume_method - send SIGCONT
  • terminate_method - send SIGKILL

Other than messing with the default values I can see no other reason for SGE to kill the jobs instead of stop them.

Upvotes: 1

Related Questions