user56reinstatemonica8
user56reinstatemonica8

Reputation: 34074

Shell script hangs, but only when called into a variable or strace

General question: What could cause a script that works fine in itself, to hang if the script or shell (bash) command that calls it calls it into a variable?

In other words, how could there be a script that works when called like this... /path/to/script arg arg ...fail and hang when called like this... VAR=$(/path/to/script arg arg);?


(major edits after noticing that a software glitch caused much initial testing to give incorrect results)


My specific case: I've got a script that works fine (to start, stop or restart the java application Apache Solr, adapted from here). Code is below, its command is sbin/service solr [action], for example sbin/service solr start.

When called from a script or directly from a console (bash in my case), like sbin/service solr start, it works fine and completes quickly. If, however, it's called into a variable, like VAR=$(sbin/service solr start);, it works, but hangs with a futext / clock_gettime loop (trace below). It also hangs if it's called not into a variable, but into strace.

Curiously, other scripts called in the same way with the same syntax, e.g. sbin/service httpd start, work just fine when called into a variable. So it's clearly possible for there to be something about a script that makes it hang when the output is stored as a variable, but operate perfectly well when this isn't the case.


Here's the results of testing what calls hang and what don't:

HANGS------------------------------------------------

(so it doesn't matter which process it's called from) Also, editting the script file to start the service using source causes the service to not work.

DOESN'T HANG-------------------------------------

Outputting to /dev/null allows us to request the output without causing it to hang. However, it's not much use, as no actual output is received.

Contrary to what I first thought. This outputs a simple update message which, ideally, we'd capture in a variable and log - but attempting to do so causes it to hang.

The syntax that hangs works just fine on other service scripts, and the output of the script is passed to the variable without issue.


Here's the full code of that script: (comments removed, naturally the $SOLR_DIR path is a real one in the real script)

SOLR_DIR="[path/to/application]"
JAVA_OPTIONS="-Xms64m -Xmx64m -DSTOP.PORT=8079 -DSTOP.KEY=mustard -jar start.jar"
LOG_FILE="/var/log/solr.log"
JAVA="/usr/bin/java"

case $1 in
    start)
        echo "Starting Solr"
        cd $SOLR_DIR
        $JAVA $JAVA_OPTIONS 2> $LOG_FILE &
        ;;
    stop)
        echo "Stopping Solr"
        cd $SOLR_DIR
        $JAVA $JAVA_OPTIONS --stop
        ;;
    restart)
        $0 stop
        sleep 1
        $0 start
        ;;
    *)
        echo "Usage: $0 {start|stop|restart}" >&2
        exit 1
        ;;
esac

No errors or anything unusual in var/log/solr.log (log file named in script). Centos Linux server if that's relevant.


In response to an earlier version of the question, @cdarke suggested I run strace -f -o strace.out /path/to/script on a script calling this script, and look in the (massive!) output file strace.out. It's nearly 3mbs, here's some observations:

  1. Starts with lots of activity that looks like the script functioning as intended.

  2. Then, the last 15% or so of the log file is this, repeated with different integers but seemingly the same hex codes:

...

25687 futex(0x688d454, FUTEX_WAIT_PRIVATE, 1, {0, 49980000}) = -1 ETIMEDOUT (Connection timed out)
25687 futex(0x688d428, FUTEX_WAKE_PRIVATE, 1) = 0
25687 clock_gettime(CLOCK_MONOTONIC, {39074112, 932735888}) = 0
25687 clock_gettime(CLOCK_REALTIME, {1355007234, 333458000}) = 0

These PIDs come up with nothing when put through ps -p even if I do so while the script is still running, while the output file is still getting bigger and while these lines of code are still being written. I'm not quite sure how that's possible.

Here's the final bit out the output before it gets in the never-ending futex/clock_gettime loop, after the last part that is clearly the script executing correctly (solr/solr.xml is a Solr config file, which needs to be read to start the Solr process):

25874 stat("solr/solr.xml", {st_mode=S_IFREG|0777, st_size=1320, ...}) = 0
25874 write(2, "Dec 8, 2012 5:12:05 PM org.apach"..., 106) = 106
25874 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 89
25874 fcntl(89, F_GETFL)                = 0x2 (flags O_RDWR)
25874 fcntl(89, F_SETFL, O_RDWR|O_NONBLOCK) = 0
25874 setsockopt(89, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
25874 bind(89, {sa_family=AF_INET, sin_port=htons(8983), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
25874 listen(89, 50)                    = 0
25874 setsockopt(89, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
25874 lseek(12, 57747, SEEK_SET)        = 57747
25874 read(12, "PK\3\4\n\0\0\0\10\0\221Vi>F\347\254\364\325\4\0\0002\t\0\0002\0\0\0", 30) = 30
25874 lseek(12, 57827, SEEK_SET)        = 57827
25874 read(12, "\225V\377oSU\24\377\334\273\256\257_\36l\216m\254\262\351\224\241]\273\255\200\314/\5\246c\200"..., 1237) = 1237
25874 futex(0x2aaab0173054, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2aaab0173050, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
25894 <... futex resumed> )             = 0
25894 futex(0x2aaab0173028, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
25874 <... futex resumed> )             = 1
25874 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
25894 <... futex resumed> )             = 0
25894 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1) = 0
25894 clock_gettime(CLOCK_REALTIME, {1355008325, 376033000}) = 0
25894 futex(0x2aaab0173054, FUTEX_WAIT_PRIVATE, 3, {0, 983000} <unfinished ...>
25874 <... futex resumed> )             = 1
25874 futex(0x2aaab0173054, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2aaab0173050, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
25894 <... futex resumed> )             = 0
25894 futex(0x2aaab0173028, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
25874 <... futex resumed> )             = 1
25874 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
25894 <... futex resumed> )             = 0
25894 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1) = 0
25894 poll([{fd=89, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
25874 <... futex resumed> )             = 1
25874 write(2, "2012-12-08 17:12:05.376:INFO::St"..., 66) = 66
25874 write(2, "\n", 1)                 = 1
25874 mmap(0x41348000, 12288, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x41348000
25874 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
25874 sched_getaffinity(25874, 32,  { ffff, 0, 0, 0 }) = 32
25874 sched_getaffinity(25874, 32,  { ffff, 0, 0, 0 }) = 32
25874 gettid()                          = 25874
25874 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
25874 rt_sigprocmask(SIG_UNBLOCK, [HUP ILL BUS FPE SEGV USR2 TERM], NULL, 8) = 0
25874 rt_sigprocmask(SIG_BLOCK, [QUIT], NULL, 8) = 0
25874 mmap(0x41348000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x41348000
25874 mprotect(0x41348000, 12288, PROT_NONE) = 0
25874 futex(0x10632d54, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
25882 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
25882 futex(0x106cc428, FUTEX_WAKE_PRIVATE, 1) = 0
25882 clock_gettime(CLOCK_MONOTONIC, {39075204, 21489888}) = 0
25882 clock_gettime(CLOCK_REALTIME, {1355008325, 422198000}) = 0
25882 futex(0x106cc454, FUTEX_WAIT_PRIVATE, 1, {0, 49984000}) = -1 ETIMEDOUT (Connection timed out)
25882 futex(0x106cc428, FUTEX_WAKE_PRIVATE, 1) = 0
25882 clock_gettime(CLOCK_MONOTONIC, {39075204, 72479888}) = 0
25882 clock_gettime(CLOCK_REALTIME, {1355008325, 473185000}) = 0
25882 futex(0x106cc454, FUTEX_WAIT_PRIVATE, 1, {0, 49987000}) = -1 ETIMEDOUT (Connection timed out)
25882 futex(0x106cc428, FUTEX_WAKE_PRIVATE, 1) = 0

So the last line before the death spiral is a read() on channel 12. Then it just loops futex and clock_gettime until it's manually killed.


This last point may be irrelevant, but if, similar to in this question, I run the script that calls this script using nohup and divert output to /dev/null, I get the following near the start ( about 100kb into the output file): a huge number of these:

25664 close(67) = -1 EBADF (Bad file descriptor)

They go from 67, increasing +1 each time, to:

25664 close(1023) = -1 EBADF (Bad file descriptor)

Then they're followed by

25664 open("/dev/null", O_RDWR) = 3

Again, the PIDs are empty, so far as I can see. Not sure if this is relevant - I guess this opens the possibility that using nohup with output to /dev/null really is a general fix for problems like this but I'm doing it wrong somehow, causing these errors.

Upvotes: 5

Views: 4478

Answers (1)

Gordon Davisson
Gordon Davisson

Reputation: 125788

I'm pretty sure the problem is that the shell is capturing output from the /sbin/service script and the solr service it starts, and hence will wait for the service to exit (or at least close its stdout) before continuing. Here's a simple demo:

$ bg_service() { while true; do sleep 10; done; }
$ start_bg_service() { echo "starting"; bg_service& echo "running"; }
$ start_bg_service 
starting
[1] 8656
running
$ var=$(start_bg_service)
[It hangs at this point... until I open another shell and kill the background process]

Upvotes: 2

Related Questions