David Schwartz
David Schwartz

Reputation: 182829

Process stuck in exit, shows as zombie but cannot be reaped

I have a process that's monitored by its parent. The child encountered an error that caused it to call abort. The process does not tamper with the abort process, so it should proceed as expected (dump core, terminate). The parent is supposed to detect the child's termination and trigger a series of events to respond to the failure. The child is multi-threaded and complex.

Here's what I see from ps:

F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
0  1000  4929  1272  20   0  85440  6792 wait   S+   pts/2      0:00 rxd
1  1000  4930  4929  20   0      0     0 exit   Zl+  pts/2     38:21 [rxd] <defunct>

So the child (4930) has terminated. It is a zombie. I cannot attach to it, as expected. However, the parent (4929) stays blocked in:

int i;
// ...
waitpid (-1, &i, 0);

So it seems like the child is a zombie but somehow has not completed everything necessary for its parent to reap it. The WCHAN field of exit is, I think, a valuable clue.

The platform is 64-bit Linux, Ubuntu 13.04, kernel 3.8.0-30. The child doesn't appear to be dumping core or doing anything. I've left the system for several minutes and nothing changed.

Does anyone have any ideas what might be causing this or what I can do about it?

Update: Another interesting bit of information -- if I kill -9 the parent process, the child goes away. This is kind of baffling, since the parent process is trivial, just blocking in waitpid. Also, I don't get any core dump (from the child) when this problem happens.

Update: It seems the child is stuck in schedule, called from exit_mm, called from do_exit. I wonder why exit_mm would call schedule. And I wonder why killing the parent would unstick it.

Upvotes: 3

Views: 3925

Answers (1)

David Schwartz
David Schwartz

Reputation: 182829

I finally figured it out! The process was actually doing useful work all this time. The process held the last reference to a large file on a slow filesystem. When the process terminates, the last reference to the file is release, forcing the OS to reclaim the space. The file was so large that this required tens of thousands of I/O operations, taking 10 minutes or more.

Upvotes: 8

Related Questions