Nick Chammas
Nick Chammas

Reputation: 12672

subprocess.communicate() mysteriously hangs only when run from a script

I am invoking a Python tool called spark-ec2 from a Bash script.

As part of its work, spark-ec2 makes several calls to the system's ssh command via use of the subprocess module.

Here's an example:

s = subprocess.Popen(
    ssh_command(opts) + ['-t', '-t', '-o', 'ConnectTimeout=3',
                         '%s@%s' % (opts.user, host), stringify_command('true')],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT  # we pipe stderr through stdout to preserve output order
)
cmd_output = s.communicate()[0]  # [1] is stderr, which we redirected to stdout

For some reason, spark-ec2 is hanging on that line where communicate() is called. I have no idea why.

For the record, here is an excerpt that shows how I'm invoking spark-ec2:

# excerpt from script-that-calls-spark-ec2.sh

# snipped: load AWS keys and do other setup stuff

timeout 30m spark-ec2 launch "$CLUSTER_NAME" ...

# snipped: if timeout, report and exit

What's killing me is that when I call spark-ec2 alone it works fine, and when I copy and paste commands from this Bash script and run them interactively they work fine.

It's only when I execute the whole script like this

$ ./script-that-calls-spark-ec2.sh

that spark-ec2 hangs on that communicate() step. This is driving me nuts.

What's going on?

Upvotes: 1

Views: 1970

Answers (1)

Nick Chammas
Nick Chammas

Reputation: 12672

This is one of those things that, once I figured it out, made me say "Wow..." out loud in a mixture of amazement and disgust.

In this case, spark-ec2 isn't hanging because of some deadlock related to the use of subprocess.PIPE, as might've been the case if spark-ec2 had used Popen.wait() instead of Popen.communicate().

The problem, as hinted to by the fact that spark-ec2 only hangs when the whole Bash script is invoked at once, is caused by something that behaves in subtly different ways depending on whether it's being called interactively or not.

In this case the culprit is the GNU coreutils utility timeout, and an option it offers called --foreground.

From the timeout man page:

   --foreground

          when not running timeout directly from a shell prompt,

          allow  COMMAND  to  read  from  the TTY and get TTY signals; in this
          mode, children of COMMAND will not be timed out

Without this option, Python's communicate() cannot read the output of the SSH command being invoked by subprocess.Popen().

This probably has something to do with SSH allocating TTYs via the -t switches, but honestly I don't fully understand it.

What I can say, though, is that modifying the Bash script to use the --foreground option like this

timeout --foreground 30m spark-ec2 launch "$CLUSTER_NAME" ...

makes everything work as expected.

Now, if I were you, I would consider converting that Bash script into something else that won't drive you nuts...

Upvotes: 2

Related Questions