Buffer corruption with correctly managed heap

Question

I have a heisenbug that occurs so infrequently that it's not replicatable on any environment, it fails spectacularly and I have no idea how to diagnose it.

The bug is related to memory usage. The corruption doesn't fit into the defined four categories of corruption.

Tooling shows that it is not:

Uninitialized memory (it was previously allocated)
Using non-owned memory (it was allocated by the thread and owned by thread)
Buffer Overflow (Boundary checks pass)
Faulty heap memory management (this isn't a leak, all memory is freed, and set null defensively)

I say this with some confidence because while I cannot replicate it, logging and tooling in the higher transaction environments indicate the above do not happen.

I'm compiling gcc c11 no optimization, Wall, other minimal flags.

ASAN, electric fence, hellgrind, memcheck, cppcheck find no problems.

Heap management appears to work well with a pool allocator, boundary checks, corruption sentinel.

Absolutely no unit tests

The issue is primarily seen when very very rarely an array is corrupted, invalid boundaries are set there are only 50 items but the count of items gets corrupted and we end up with < 0 or > 50. Core dumps show this. By determining where this array bound comes from and verifying the correct value we can prevent this issue, but then the problem migrates to another location. Since this only affects a single customer and single transaction type that indicates to me something related to this customer or transaction. But that tree has borne no fruit.

Due to how infrequent this occurs I cannot rule out:

Another thread corruption
Some data race condition
Thread race condition
Programming error writing to place it shouldn't.

I am unable to run any of the above tooling (ASAN, electric fence, ...) in an environment that simulates the conditions that cause this. But I cannot replicate it any any environment that I can run these toolings.

My only thoughts would be to:

Create deep copies or serialize these objects and splice checks throughout the code base. (messy, might be impossible due to memory constraints)
Ignore the problem (unevenly effecting a single customer means I cannot do this)
Keep playing ping pong and make the codebase even more ugly with all these error checks looking for corruption.
Rewrite it all in $Language (not really an option)
Try a new pool allocator or arena allocator to see if there's an uknown bug in our custom one.

I'm looking for novel approaches to this that I haven't considered. Ways to automate this, better tooling for these type of problems. How do you validate that an object hasn't changed behind your back?

Buffer corruption with correctly managed heap

Answers (1)

Related Questions