Reputation: 101
I have a singlethreaded Unix process that communicates over TCP with other processes.
The problem is the following. When I start up the process it hangs (no busy loop) until I kill it.
The funny thing is, as soon as I attach with strace to it, it continues to run with the expected behavior as if there wasn't any problem at all (always reproducible).
What could be the reason for this behavior? What effect has strace on the state of a process?
The cause of strace changing the behavior was, because we used openonload with a bug. As soon as we attached strace, the stack was moved back to the kernel and the problem was gone.
Upvotes: 6
Views: 4214
Reputation: 4154
Many years later, so probably with a completely different root cause, this blog post explains why attaching a tracer might fix hung system calls: https://ayende.com/blog/198849-C/production-postmortem-the-heisenbug-server?Key=1eeda567-02a8-4bbb-b90f-557523973233. It looks like running strace
(or any other tool that uses the ptrace
system call) can causing in "hung" system calls to return (with an exit code of EINTR
).
Quoting the ptrace man page:
Some system calls return with EINTR if a signal was sent to a
tracee, but delivery was suppressed by the tracer. (This is very typical operation: it is usually done by debuggers on every attach, in order to not introduce a bogus SIGSTOP). As of Linux 3.2.9, the following system calls are affected (this list is likely incomplete): epoll_wait(2), and read(2) from an inotify(7) file descriptor. The usual symptom of this bug is that when you attach to a quiescent process with the command
strace -p <process-ID>
then, instead of the usual and expected one-line output such as
restart_syscall(<... resuming interrupted call ...>_
or
select(6, [5], NULL, [5], NULL_
('_' denotes the cursor position), you observe more than one line. For example:
clock_gettime(CLOCK_MONOTONIC, {15370, 690928118}) = 0 epoll_wait(4,_
What is not visible here is that the process was blocked in epoll_wait(2) before strace(1) has attached to it. Attaching caused epoll_wait(2) to return to user space with the error EINTR. In this particular case, the program reacted to EINTR by checking the current time, and then executing epoll_wait(2) again. (Programs which do not expect such "stray" EINTR errors may behave in an unintended way upon an strace(1) attach.)
Upvotes: 4
Reputation: 1096
I had this problem only once and it was related to signal handling. It is one source of race conditions in single-threaded code.
Upvotes: 0
Reputation: 4398
Most likely that strace output simply slows down the process making deadlocks much less likely. I have seen this happen before with strace OR can happen when adding other debug printing or debug calls.
Deadlocks most often seen with multi-threaded interaction. But in your case you have multiple processes. If the strace frees up the processes every time then I guess the way you open the sockets or handshake on the socket is what is hanging. Buffering and blocking on the socket I think could be getting you into a process-deadlocked state.
Similar question but with a multi-threaded process, deadlock between threads instead of between seperate processes: Using strace fixes hung memory issue
Hard to generalise examples, especially as don't know what your different processes are doing or if they're sharing resources in some way? I will try . . .
Example with one object/resource which should be protected:
One process starts making changes on an object (e.g. adding items to a list/db table)
Another process starts iterating the list/table.
Danger of one of those processes iterating loop being confused and never exiting OR doing something worse like writing to invalid memory.
Example where object/resource is protected by mutexes
The classic simple deadlock with two resources problem.
~ simpler than dining philosophers
One thread/process grabs mutex on object A, does some work.
Another thread/process grabs mutex on object B, does some work.
Same thread/process needs to update object A, waits for mutex for A.
Original thread/process needs to access object B, waits for mutex on B.
. . . . . . . . . . . . @ . . . . . . . . . . .
Silence except for the noise of the wind and a tumbleweed blowing across the landscape.
Deadlocked.
Upvotes: 0