Reputation: 12672
I am invoking a Python tool called spark-ec2 from a Bash script.
As part of its work, spark-ec2 makes several calls to the system's ssh
command via use of the subprocess
module.
s = subprocess.Popen(
ssh_command(opts) + ['-t', '-t', '-o', 'ConnectTimeout=3',
'%s@%s' % (opts.user, host), stringify_command('true')],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT # we pipe stderr through stdout to preserve output order
)
cmd_output = s.communicate()[0] # [1] is stderr, which we redirected to stdout
For some reason, spark-ec2 is hanging on that line where communicate()
is called. I have no idea why.
For the record, here is an excerpt that shows how I'm invoking spark-ec2:
# excerpt from script-that-calls-spark-ec2.sh
# snipped: load AWS keys and do other setup stuff
timeout 30m spark-ec2 launch "$CLUSTER_NAME" ...
# snipped: if timeout, report and exit
What's killing me is that when I call spark-ec2 alone it works fine, and when I copy and paste commands from this Bash script and run them interactively they work fine.
It's only when I execute the whole script like this
$ ./script-that-calls-spark-ec2.sh
that spark-ec2 hangs on that communicate()
step. This is driving me nuts.
What's going on?
Upvotes: 1
Views: 1970
Reputation: 12672
This is one of those things that, once I figured it out, made me say "Wow..." out loud in a mixture of amazement and disgust.
In this case, spark-ec2 isn't hanging because of some deadlock related to the use of subprocess.PIPE
, as might've been the case if spark-ec2 had used Popen.wait()
instead of Popen.communicate()
.
The problem, as hinted to by the fact that spark-ec2 only hangs when the whole Bash script is invoked at once, is caused by something that behaves in subtly different ways depending on whether it's being called interactively or not.
In this case the culprit is the GNU coreutils utility timeout
, and an option it offers called --foreground
.
From the timeout
man page:
--foreground
when not running timeout directly from a shell prompt,
allow COMMAND to read from the TTY and get TTY signals; in this
mode, children of COMMAND will not be timed out
Without this option, Python's communicate()
cannot read the output of the SSH command being invoked by subprocess.Popen()
.
This probably has something to do with SSH allocating TTYs via the -t
switches, but honestly I don't fully understand it.
What I can say, though, is that modifying the Bash script to use the --foreground
option like this
timeout --foreground 30m spark-ec2 launch "$CLUSTER_NAME" ...
makes everything work as expected.
Now, if I were you, I would consider converting that Bash script into something else that won't drive you nuts...
Upvotes: 2