Perl system(), exec() and interactions with LSF

Question

I have a script that has to kick off 2 independent processes, and wait until one of them finishes before continuing.

Up to now, I've run it by creating one process with an if fork pid == 0, exec, else wait. The other one is created using system and the command line.

Now I'm preparing to roll this script out to run 400 iterations of such work-pair processes on Platform Load Sharing Facility (LSF), however I'm concerned with stability. I know that the processes can crash. In such a case, I need a method to know when a process has crashed, and kill its pair process and the main script.

Originally I had written a watchdog with a 3 minute watch period, if 3 minutes of inactivity pass, it kills the processes. However this caught a lot of false positives, because when the LSF suspends one of the two processes, the watchdog saw them as inactive.

In LSF, when I issue the jobs, I have the option to kill them. However, when I kill a job, what exactly do I kill? Will the kill take down the two processes the Perl script has created? or leave them running as zombies?

To reiterate,

Will killing a job on the LSF queue also kill every process that job has created?
Whats the best (safest?) way to generate two independent processes from a Perl script, and to wait until one of them exits before continuing?
How can I write a watchdog that can distinguish between a processes having crashed, and a process that is suspended by the LSF admin?

ikegami · Accepted Answer

The monitor is the one that should be creating the child processes. (It can also launch the "main script" too.) wait will tell you when they crash.

my %children;

my $pid1 = fork();
if (!defined($pid1)) { ... }
if ($pid1) { ... }
++$children{$pid1};

my $pid2 = fork();
if (!defined($pid2)) { ... }
if ($pid2) { ... }
++$children{$pid2};

while (keys(%children)) {
   my $pid = wait();
   next if !$children{$pid};  # !!!

   delete($children{$pid});

   if ($? & 0x7F) { ... }   # Killed from signal
   if ($? >> 8) { ... }     # Returned an error
}

Perl system(), exec() and interactions with LSF

Answers (1)

Related Questions