Reputation: 182829
I have a process that's monitored by its parent. The child encountered an error that caused it to call abort
. The process does not tamper with the abort process, so it should proceed as expected (dump core, terminate). The parent is supposed to detect the child's termination and trigger a series of events to respond to the failure. The child is multi-threaded and complex.
Here's what I see from ps
:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
0 1000 4929 1272 20 0 85440 6792 wait S+ pts/2 0:00 rxd
1 1000 4930 4929 20 0 0 0 exit Zl+ pts/2 38:21 [rxd] <defunct>
So the child (4930) has terminated. It is a zombie. I cannot attach to it, as expected. However, the parent (4929) stays blocked in:
int i;
// ...
waitpid (-1, &i, 0);
So it seems like the child is a zombie but somehow has not completed everything necessary for its parent to reap it. The WCHAN
field of exit
is, I think, a valuable clue.
The platform is 64-bit Linux, Ubuntu 13.04, kernel 3.8.0-30. The child doesn't appear to be dumping core or doing anything. I've left the system for several minutes and nothing changed.
Does anyone have any ideas what might be causing this or what I can do about it?
Update: Another interesting bit of information -- if I kill -9
the parent process, the child goes away. This is kind of baffling, since the parent process is trivial, just blocking in waitpid
. Also, I don't get any core dump (from the child) when this problem happens.
Update: It seems the child is stuck in schedule
, called from exit_mm
, called from do_exit
. I wonder why exit_mm
would call schedule
. And I wonder why killing the parent would unstick it.
Upvotes: 3
Views: 3925
Reputation: 182829
I finally figured it out! The process was actually doing useful work all this time. The process held the last reference to a large file on a slow filesystem. When the process terminates, the last reference to the file is release, forcing the OS to reclaim the space. The file was so large that this required tens of thousands of I/O operations, taking 10 minutes or more.
Upvotes: 8