Reputation: 33
Hi A software named G09 works in parallel using Linda. It spawns its parallel child processes on other nodes (hosts) as
/usr/bin/ssh -x compute-0-127.local -n /usr/local/g09l/g09/linda-exe/l1002.exel ...other_opts...
However, when the master node kills this process, the corresponding child process on other node, namely compute-0-127 does not die but keeps running in background. Right now, I manually go to each node which has these orphaned Linda processes and kill them with kill
. Is there any way to kill such child processes?
Look at pastebin 1 for PSTREE before killing the process and at pastebin 2 for PSTREE after parent is killed
pastebin1 - http://pastebin.com/yNXFR28V
pastebin2 - http:// pastebin.com/ApwXrueh
-not enough reputation points for hyperlinking second pastebin, sorry !(
Update to Answer1
Thanks Martin for explaining. I tried following
killme() { kill 0 ; } ; #Make calls to prepare for running G09 ;
g09 < "$g09inp" > "$g09out" &
trap killme 'TERM'
wait
but when Torque/Maui (which handles job execution) kills the job(this script) as qdel $jobid
the processes started by G09 as ssh -x $host -n
still run in the background. What am I doing wrong here ? (Normal termination is not a problem as G09 itself stops those processes.) Following is pstree
before qdel
bash
|-461.norma.iitb. /opt/torque/mom_priv/jobs/461.norma.iitb.ac.in.SC
| `-g09
| `-l1002.exe 1048576000Pd-C-C-addn-H-MO6-fwd-opt.chk
| `-cLindaLauncher/tmp/viaExecDataN6
| |-l1002.exel 1048576000Pd-C-C-addn-H-MO6-fwd-opt.ch
| | |-{l1002.exel}
| | |-{l1002.exel}
| | |-{l1002.exel}
| | |-{l1002.exel}
| | |-{l1002.exel}
| | |-{l1002.exel}
| | |-{l1002.exel}
| | `-{l1002.exel}
| |-ssh -x compute-0-149.local -n ...
| |-ssh -x compute-0-147.local -n ...
| |-ssh -x compute-0-146.local -n ...
| |-{cLindaLauncher}
| `-{cLindaLauncher}
`-pbs_demux
and after qdel
it still shows
461.norma.iitb. /opt/torque/mom_priv/jobs/461.norma.iitb.ac.in.SC
`-ssh -x -n compute-0-149 rm\040-rf\040/state/partition1/trirag09/461
l1002.exel 1048576000Pd-C-C-addn-H-MO6-fwd-opt.ch
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
`-{l1002.exel}
ssh -x compute-0-149.local -n /usr/local/g09l/g09/linda-exe/l1002.exel
ssh -x compute-0-147.local -n /usr/local/g09l/g09/linda-exe/l1002.exel
ssh -x compute-0-146.local -n /usr/local/g09l/g09/linda-exe/l1002.exel
What am I doing wrong here ? is the trap killme 'TERM'
wrong ?
Upvotes: 2
Views: 1431
Reputation: 1254
I had a similar problem using ssh -N
(similar to ssh -n
), and kill -9 0
does not work for me if I run it inside a script that initiates the ssh call. I find that kill
does terminate the ssh process, which is not very elegant, but I am using that currently.jobs -p
Upvotes: 0
Reputation: 127457
I would try the following approach:
Sending a KILL signal to the process group is really easy: kill -9 0
. Try this:
#!/bin/sh
./b.sh 1 &
./b.sh 2 &
sleep 10
kill -9 0
where b.sh is
#!/bin/sh
while /bin/true
do
echo $1
sleep 1
done
You can have as many child processes as you want (directly or indirectly); they will all get the signal - as long as they don't detach themselves from the process group.
Upvotes: 1