John U
John U

Reputation: 2993

Diagnosing process stuck in D state (uninterruptable sleep / blocked IO)

We are working on an embedded Linux system using Live555 WIS-Streamer to stream video over RTSP over a network.

On one particular system we see WIS-Streamer get stuck in an TASK_UNINTERRUPTIBLE state; From the command line: the ps status for the process is shown as DW, children of the WIS-process are all listed as Zombie state.

It looks like there's nothing we can do once we're in this state, other than reboot (not desirable). However, we'd really like to get to the root cause of this - I suspect that within the streamer it's hanging on a blocking send call or somesuch. Is there anything we can do, either in the code or via the command line etc. to try and narrow down what's blocked?

As an example, I've tried looking at the output of netstat (netstat -alp) to see if there are dangling sockets attached to the PID of the blocked / zombie thread but to no avail.

Update with more info:

It's not thrashing the CPU, top lists blocked & zombie threads as 0% mem / 0% CPU / VSZ 0.

Further things I've tried poking about the system:

/proc/status/ for main & child threads 546 is the parent, which is blocked:

$> cat /proc/546/stat    
Name:   wis-streamer
State:  D (disk sleep)
Tgid:   546
Pid:    546
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 0
Groups: 
Threads:        1
SigQ:   17/353
SigPnd: 0000000000000000
ShdPnd: 0000000000004102
SigBlk: 0000000000000000
SigIgn: 0000000000001004
SigCgt: 0000000180006a02
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed:   1
Cpus_allowed_list:      0
voluntary_ctxt_switches:        997329
nonvoluntary_ctxt_switches:     2428751

Children:

Name:   wis-streamer
State:  Z (zombie)
Tgid:   581
Pid:    581
PPid:   546
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 0
Groups: 
Threads:        1
SigQ:   17/353
SigPnd: 0000000000000000
ShdPnd: 0000000000000102
SigBlk: 0000000000000000
SigIgn: 0000000000001004
SigCgt: 0000000180006a02
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed:   1
Cpus_allowed_list:      0
voluntary_ctxt_switches:        856676
nonvoluntary_ctxt_switches:     15626

Name:   wis-streamer
State:  Z (zombie)
Tgid:   582
Pid:    582
PPid:   546
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 0
Groups: 
Threads:        1
SigQ:   17/353
SigPnd: 0000000000000000
ShdPnd: 0000000000000102
SigBlk: 0000000000000000
SigIgn: 0000000000001004
SigCgt: 0000000180006a02
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed:   1
Cpus_allowed_list:      0
voluntary_ctxt_switches:        856441
nonvoluntary_ctxt_switches:     15694


Name:   wis-streamer
State:  Z (zombie)
Tgid:   583
Pid:    583
PPid:   546
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 0
Groups: 
Threads:        1
SigQ:   17/353
SigPnd: 0000000000000000
ShdPnd: 0000000000000102
SigBlk: 0000000000000000
SigIgn: 0000000000001004
SigCgt: 0000000180006a02
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed:   1
Cpus_allowed_list:      0
voluntary_ctxt_switches:        856422
nonvoluntary_ctxt_switches:     15837


Name:   wis-streamer
State:  Z (zombie)
Tgid:   584
Pid:    584
PPid:   546
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 0
Groups: 
Threads:        1
SigQ:   17/353
SigPnd: 0000000000000000
ShdPnd: 0000000000000102
SigBlk: 0000000000000000
SigIgn: 0000000000001004
SigCgt: 0000000180006a02
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed:   1
Cpus_allowed_list:      0
voluntary_ctxt_switches:        856339
nonvoluntary_ctxt_switches:     15500

Other things from /proc/ filesys:

$> cat /proc/546/personality
00c00000
$> cat /proc/546/stat
546 (wis-streamer) D 1 453 453 0 -1 4194564 391 0 135 0 140098 232409 0 0 20 0 1 0 1094 0 0 4294967295 0 0 0 0 0 0 0 4100 27138 3223605768 0 0 17 0 0 0 0 0 0

Update upon update:

I have a feeling that a SysV-IPC message queue or semaphore call around such may be hanging - our system is held together by inter-process message queues (at least 40% Not Invented Here, written by Elbonian Code Slaves as part of a horrible horrible SDK) which can trap the unwary. I have re-jigged a couple of semaphore get/release routines which I suspect were less than fully wateright (in fact probably only just squirrel-proof) and will keep an eye on things - unfortunately it takes on average 12 hours running on a very particular test setup to induce this failure.

Upvotes: 2

Views: 7065

Answers (1)

Armali
Armali

Reputation: 19375

From the Documentation for sysrq:

'w' - Dumps tasks that are in uninterruptable (blocked) state.


echo w >/proc/sysrq-trigger

shows extensive information about the blocked task(s) on the console (should also be viewable through dmesg); in particular the kernel stack trace is helpful for illuminating the issue.

Upvotes: 12

Related Questions