Debugging python segmentation faults in garbage collection

Question

I'm faced with segmentation faults (SIGSEGV) which occur during garbage collection in cPython. I've also had one instance where a process was killed with SIGBUS. My own code is mostly python and a little bit of very high level Cython. I'm most certainly not - deliberately and explicitly - fooling around with pointers or writing to memory directly.

Example backtraces from coredumps (extracted with gdb):

#0  0x00007f8b0ac29471 in subtype_dealloc (self=)
    at /usr/src/debug/Python-3.5.1/Objects/typeobject.c:1182
#1  0x00007f8b0abe8947 in method_dealloc (im=0x7f8afc883e08) at /usr/src/debug/Python-3.5.1/Objects/classobject.c:198
#2  0x00007f8b0ac285a9 in clear_slots (type=type@entry=0x560219f0fa88, 
    self=self@entry=) at /usr/src/debug/Python-3.5.1/Objects/typeobject.c:1044
#3  0x00007f8b0ac29506 in subtype_dealloc (self=)
    at /usr/src/debug/Python-3.5.1/Objects/typeobject.c:1200
#4  0x00007f8b0ac8caad in PyEval_EvalFrameEx (
    f=f@entry=Frame 0x56021a01ff08, for file /usr/lib64/python3.5/asyncio/base_events.py, line 1239, in _run_once (self=<_UnixSelectorEventLoop(_coroutine_wrapper_set=False, _current_handle=None, _ready=, _closed=False, _task_factory=None, _selector=) at remote 0x7f8afc868748>, _epoll=, _fd_to_key={4: , 6: , 7: , 8: , 9: , 10: , 11: , 12: , 13: , 14: , 15: , 16: , 17: , 18:



During the sweep (I think):

#1  0x00007f0a0624c63f in visit_decref (op=, data=)
    at /usr/src/debug/Python-3.5.1/Modules/gcmodule.c:373
#2  0x00007f0a0624b813 in subtract_refs (containers=)
    at /usr/src/debug/Python-3.5.1/Modules/gcmodule.c:398
#3  collect (generation=generation@entry=2, n_collected=n_collected@entry=0x7f09d75708c8, 
    n_uncollectable=n_uncollectable@entry=0x7f09d75708d0, nofail=nofail@entry=0)
    at /usr/src/debug/Python-3.5.1/Modules/gcmodule.c:951


And also once during malloc:

#0  _PyObject_Malloc (ctx=0x0, nbytes=56) at /usr/src/debug/Python-3.4.3/Objects/obmalloc.c:1159
1159                if ((pool->freeblock = *(block **)bp) != NULL) {
(gdb) bt
#0  _PyObject_Malloc (ctx=0x0, nbytes=56) at /usr/src/debug/Python-3.4.3/Objects/obmalloc.c:1159


And the SIGBUS trace (looks like it happened while cPython was recovering from another error):

#0  malloc_printerr (ar_ptr=0x100101f0100101a, ptr=0x7f067955da60 , str=0x7f06785a2b8c "free(): invalid size", action=3) at malloc.c:5009
5009        set_arena_corrupt (ar_ptr);
(gdb) bt
#0  malloc_printerr (ar_ptr=0x100101f0100101a, ptr=0x7f067955da60 , str=0x7f06785a2b8c "free(): invalid size", action=3) at malloc.c:5009
#1  _int_free (av=0x100101f0100101a, p=, have_lock=0) at malloc.c:3842
Python Exception  Type does not have a target.:




These back traces are from Fedora 24 with Python 3.5.1 and some from Centos 7 with Python 3.4.3. So I'm ruling out issues like:


bad memory (could be, but quite a coincidence that my laptop and a (virtual) server exhibit the same issue and show no otherwise work fine).
Issues in the OS or cPython or the combination thereof.


So - as usual - it must be something in my own code. It's a mixture of threaded code for some concurrency of (computational) 'tasks' and a thread running an asyncio loop. The code base has other 'workloads' which run fine. Differences in the code which is 'serving' this 'workload' which stand out to me are that I use (threading.RLock) locks quite a bit to serialize some requests and I'm serializing and writing to disk.

Any suggestions on how to find a root cause would be highly appreciated!

Things I've tried:


Stripping down the external dependencies used to the bare minimum (which was cloudpickle): no difference
Dumping counts of all object types as seen by the garbage collector before and after explicit collection and seeing if there were types of objects tracked just before a crash and weren't in the 'successful' garbage collections: there weren't.
Going through core dumps with GDB: not really consistent regarding the C nor the Python backtraces.
Running with MALLOC_CHECK_=2: I don't see any error messages other than that a process has exited with exit code -11.




Edit 1

Got something: https://gist.github.com/frensjan/dc9bc784229fec844403c9d9528ada66 

Most notably:

==19072== Invalid write of size 8
==19072==    at 0x47A3B7: subtype_dealloc (typeobject.c:1157)
==19072==    by 0x4E0696: PyEval_EvalFrameEx (ceval.c:1388)
==19072==    by 0x58917B: gen_send_ex (genobject.c:104)
==19072==    by 0x58917B: _PyGen_Send (genobject.c:158)
...
==19072==    by 0x4E2A6A: call_function (ceval.c:4262)
==19072==    by 0x4E2A6A: PyEval_EvalFrameEx (ceval.c:2838)
==19072==  Address 0x8 is not stack'd, malloc'd or (recently) free'd


and

==19072== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==19072==  Access not within mapped region at address 0x8
==19072==    at 0x47A3B7: subtype_dealloc (typeobject.c:1157)
==19072==    by 0x4E0696: PyEval_EvalFrameEx (ceval.c:1388)
==19072==    by 0x58917B: gen_send_ex (genobject.c:104)
==19072==    by 0x58917B: _PyGen_Send (genobject.c:158)


But the errors valgrind found point to the same locations as the coredumps I got earlier, nowhere near my code ... not really sure what to do with this. 

Environment: Python 3.4.5+ built with ./configure --without-pymalloc on Centos 7. Ran python with valgrind --tool=memcheck --dsymutil=yes --track-origins=yes --show-leak-kinds=all --trace-children=yes python ...

Any help is much appreciated!

Frens Jan · Accepted Answer

I am rather confident that issue I ran across is due to issue 26617 in CPython which is fixed in 7bfa6f2.

I've check this by reproducing the issue (running into SIGSEGV) with a build of the commit just before 7bfa6f2 and not being able to reproduce it with a build of commit 7bfa6f2.

Debugging python segmentation faults in garbage collection

Edit 1

Answers (1)

Related Questions