gartenriese
gartenriese

Reputation: 4356

checking per ssh if a specific program is still running, in parallel

I have several machines where I have a program running. Every 30 seconds or so I want to check if those programs are still running. I use the following command to do that.

ssh ${USER}@${HOSTS[i]} "bash -c 'if [[ -z \"\$(pgrep -u ${USER} program)\" ]]; then exit 1; else exit 0; fi'"

Now running this on >100 machines takes a long time and I want to speed that up by checking in parallel. I am aware of '&' and 'parallel', but I am unsure how to retreive the return value (task completed or not).

Upvotes: 1

Views: 144

Answers (1)

Charles Duffy
Charles Duffy

Reputation: 295373

The following lets all connections complete before starting any in the next batch, and thus can potentially wait for more than 30 seconds -- but should give you a good idea of how to do what you're looking for:

hosts=( host1 host2 host3 )
user=someuser
script="script you want to run on each remote host"

last_time=$(( SECONDS - 30 ))
while (( ( SECONDS - last_time ) >= 30 )) || \
      sleep $(( 30 - (SECONDS - last_time) )); do
  last_time=$SECONDS
  declare -A pids=( )
  for host in "${hosts[@]}"; do
    ssh "${user}@${host}" "$script" & pids[$!]="$host"
  done
  for pid in "${!pids[@]}"; do
    wait "$pid" || {
      echo "Failure monitoring host ${pids[$pid]} at time $SECONDS" >&2
    }
  done
done

Now, bigger picture: Don't do that.

Almost every operating system has a process supervision framework. Ubuntu has Upstart; Fedora and CentOS 7 have systemd; MacOS X has launchd; runit, daemontools, and others can be installed anywhere (and are very, very easy to use -- look at the run scripts at http://smarden.org/runit/runscripts.html for examples).

Using these tools are the Right Way to monitor a process and ensure that it restarts whenever it exits: Unlike this (very high-overhead) solution they have almost no overhead at all, since they rely on the operating system notifying a process's parent when that process exits, rather than doing the work of polling for a process (and that only after all the overhead of connecting via SSH, negotiating a pair of session keys, starting a shell to run your script, etc, etc, etc).

Yes, this may be a small private project. Still, you're making extra complexity (and thus, extra bugs) for yourself -- and if you learn to use the tools to do this right, you'll know how to do things right when you have something that isn't a small private project.

Upvotes: 2

Related Questions