Reputation: 1061
My Python application running on a 64-core Linux box normally runs without a problem. Then after some random length of time (around 0.5 to 1.5 days usually) I suddenly start getting frequent pauses/lockups of over 10 seconds! During these lockups the system CPU time (i.e. time in the kernel) can be over 90% (yes: 90% of all 64 cores, not of just one CPU).
My app is restarted often throughout the day. Restarting the app does not fix the problem. However, rebooting the machine does.
Question 1: What could cause 90% system CPU time for 10 seconds? All of the system CPU time is in my parent Python process, not in the child processes created through Python's multiprocessing or other processes. So that means something of the order of 60+ threads spending 10+ seconds in the kernel. I am not even sure if this is a Python issue or a Linux kernel issue.
Question 2: That a reboot fixes the problem must be a big clue as to the cause. What Linux resources could be left exhausted on the system between my app restarting, but not between reboots, that could cause this problem to get stuck on?
Below I will mention multiprocessing a lot. That's because the application runs in a cycle and multiprocessing is only used in one part of the cycle. The high CPU almost always happens immediately after all the multiprocessing calls finish. I'm not sure if this is a hint at the cause or a red herring.
psutil
to log out the process and system CPU stats every 0.5 seconds. I have independently confirmed what it's reporting with top
.forkserver
and spawn
multiprocessing contexts. I've tried them, no difference.lsof
output listing all resource herestrace
on just one thread that is running slow (it can't run across all threads because it slows the app far too much). Below is what I got which doesn't tell me much.ltrace
does not work because you can't use -p
on a thread ID. Even just running it on the main thread (no -f
) makes the app so slow that the problem doesn't show up.Environment / notes:
uname -a
: Linux 3.10.0-229.4.2.el7.x86_64 #1 SMP Wed May 13 10:06:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux (although this kernel update was only applied today)forkserver
and spawn
which I've tried.ctypes
to access a C .so library provided by a camera manufacturerPool
is created in code guarded by if __name__ == "__main__":
and in the main threadA few times I've managed to strace a thread that ran at 100% 'system' CPU. But only once have I gotten anything meaningful out of it. See below the call at 10:24:12.446614 that takes 1.4 seconds. Given it's the same ID (0x7f05e4d1072c) you see in most the other calls my guess would be this is Python's GIL synchronisation. Does this guess make sense? If so, then the question is why does the wait take 1.4 seconds? Is someone not releasing the GIL?
10:24:12.375456 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000823>
10:24:12.377076 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002419>
10:24:12.379588 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.001898>
10:24:12.382324 sched_yield() = 0 <0.000186>
10:24:12.382596 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.004023>
10:24:12.387029 sched_yield() = 0 <0.000175>
10:24:12.387279 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.054431>
10:24:12.442018 sched_yield() = 0 <0.000050>
10:24:12.442157 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.003902>
10:24:12.446168 futex(0x7f05e4d1022c, FUTEX_WAKE, 1) = 1 <0.000052>
10:24:12.446316 futex(0x7f05e4d11cac, FUTEX_WAKE, 1) = 1 <0.000056>
10:24:12.446614 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <1.439739>
10:24:13.886513 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002381>
10:24:13.889079 sched_yield() = 0 <0.000016>
10:24:13.889135 sched_yield() = 0 <0.000049>
10:24:13.889244 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.032761>
10:24:13.922147 sched_yield() = 0 <0.000020>
10:24:13.922285 sched_yield() = 0 <0.000104>
10:24:13.923628 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002320>
10:24:13.926090 sched_yield() = 0 <0.000018>
10:24:13.926244 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000265>
10:24:13.926667 sched_yield() = 0 <0.000027>
10:24:13.926775 sched_yield() = 0 <0.000042>
10:24:13.926964 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000117>
10:24:13.927241 futex(0x7f05e4d110ac, FUTEX_WAKE, 1) = 1 <0.000099>
10:24:13.927455 futex(0x7f05e4d11d2c, FUTEX_WAKE, 1) = 1 <0.000186>
10:24:13.931318 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000678>
Upvotes: 19
Views: 1159
Reputation: 1061
I've managed to get a thread dump from gdb
right at the point where 40+ threads are showing 100% 'system' CPU time.
Here's the backtrace which is the same for every one of those threads:
#0 0x00007fffebe9b407 in cv::ThresholdRunner::operator()(cv::Range const&) const () from /usr/local/lib/libopencv_imgproc.so.3.0
#1 0x00007fffecfe44a0 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, (anonymous namespace)::ProxyLoopBody, tbb::auto_partitioner const>::execute() () from /usr/local/lib/libopencv_core.so.3.0
#2 0x00007fffe967496a in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2
#3 0x00007fffe96705a6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
#4 0x00007fffe966fc6b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2
#5 0x00007fffe966d65f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2
#6 0x00007fffe966d859 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
#7 0x00007ffff76e9df5 in start_thread () from /lib64/libpthread.so.0
#8 0x00007ffff6d0e1ad in clone () from /lib64/libc.so.6
My original question put Python and Linux front and center but the issue appears to lie with TBB and/or OpenCV. Since OpenCV with TBB is so widely used I presume it has to also involve the interplay with my specific environment somehow. Maybe because it's a 64 core machine?
I have recompiled OpenCV with TBB turned off and the problem has not reappeared so far. But my app now runs slower.
I have posted this as a bug to OpenCV and will update this answer with anything that comes from that.
Upvotes: 0