Prince
Prince

Reputation: 33

kill all processes spawned by parent with `ssh -x -n` on other hosts

Hi A software named G09 works in parallel using Linda. It spawns its parallel child processes on other nodes (hosts) as

/usr/bin/ssh -x compute-0-127.local -n /usr/local/g09l/g09/linda-exe/l1002.exel ...other_opts...

However, when the master node kills this process, the corresponding child process on other node, namely compute-0-127 does not die but keeps running in background. Right now, I manually go to each node which has these orphaned Linda processes and kill them with kill. Is there any way to kill such child processes?

Look at pastebin 1 for PSTREE before killing the process and at pastebin 2 for PSTREE after parent is killed
pastebin1 - http://pastebin.com/yNXFR28V
pastebin2 - http:// pastebin.com/ApwXrueh
-not enough reputation points for hyperlinking second pastebin, sorry !(
Update to Answer1
Thanks Martin for explaining. I tried following

killme() { kill 0 ; } ; #Make calls to prepare for running G09 ; 
g09 < "$g09inp" > "$g09out" &
trap killme 'TERM'
wait

but when Torque/Maui (which handles job execution) kills the job(this script) as qdel $jobid the processes started by G09 as ssh -x $host -n still run in the background. What am I doing wrong here ? (Normal termination is not a problem as G09 itself stops those processes.) Following is pstree before qdel

bash
|-461.norma.iitb. /opt/torque/mom_priv/jobs/461.norma.iitb.ac.in.SC
|   `-g09
|       `-l1002.exe 1048576000Pd-C-C-addn-H-MO6-fwd-opt.chk
|           `-cLindaLauncher/tmp/viaExecDataN6
|               |-l1002.exel 1048576000Pd-C-C-addn-H-MO6-fwd-opt.ch
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   `-{l1002.exel}
|               |-ssh -x compute-0-149.local -n ...
|               |-ssh -x compute-0-147.local -n ...
|               |-ssh -x compute-0-146.local -n ...
|               |-{cLindaLauncher}
|               `-{cLindaLauncher}
`-pbs_demux

and after qdel it still shows

461.norma.iitb. /opt/torque/mom_priv/jobs/461.norma.iitb.ac.in.SC
`-ssh -x -n compute-0-149 rm\040-rf\040/state/partition1/trirag09/461

l1002.exel 1048576000Pd-C-C-addn-H-MO6-fwd-opt.ch
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
`-{l1002.exel}

ssh -x compute-0-149.local -n /usr/local/g09l/g09/linda-exe/l1002.exel

ssh -x compute-0-147.local -n /usr/local/g09l/g09/linda-exe/l1002.exel

ssh -x compute-0-146.local -n /usr/local/g09l/g09/linda-exe/l1002.exel

What am I doing wrong here ? is the trap killme 'TERM' wrong ?

Upvotes: 2

Views: 1431

Answers (2)

Karol
Karol

Reputation: 1254

I had a similar problem using ssh -N (similar to ssh -n), and kill -9 0 does not work for me if I run it inside a script that initiates the ssh call. I find that kill jobs -p does terminate the ssh process, which is not very elegant, but I am using that currently.

Upvotes: 0

Martin v. L&#246;wis
Martin v. L&#246;wis

Reputation: 127457

I would try the following approach:

  • create a script/application that wraps this g09 binary that you are starting, and start that wrapper instead
  • in the script, wait for the HUP signal to arrive (which should be received when the ssh connection is closed)
  • in processing the HUP signal, send a signal to your process group (i.e. PID 0) that kills all processes in the group.

Sending a KILL signal to the process group is really easy: kill -9 0. Try this:

#!/bin/sh
./b.sh 1 &
./b.sh 2 &
sleep 10
kill -9 0

where b.sh is

#!/bin/sh
while /bin/true
do
  echo $1
  sleep 1
done

You can have as many child processes as you want (directly or indirectly); they will all get the signal - as long as they don't detach themselves from the process group.

Upvotes: 1

Related Questions