will.mont
will.mont

Reputation: 81

Buffer corruption with correctly managed heap

I have a heisenbug that occurs so infrequently that it's not replicatable on any environment, it fails spectacularly and I have no idea how to diagnose it.

The bug is related to memory usage. The corruption doesn't fit into the defined four categories of corruption.

Tooling shows that it is not:

  1. Uninitialized memory (it was previously allocated)
  2. Using non-owned memory (it was allocated by the thread and owned by thread)
  3. Buffer Overflow (Boundary checks pass)
  4. Faulty heap memory management (this isn't a leak, all memory is freed, and set null defensively)

I say this with some confidence because while I cannot replicate it, logging and tooling in the higher transaction environments indicate the above do not happen.

I'm compiling gcc c11 no optimization, Wall, other minimal flags.

ASAN, electric fence, hellgrind, memcheck, cppcheck find no problems.

Heap management appears to work well with a pool allocator, boundary checks, corruption sentinel.

Absolutely no unit tests

The issue is primarily seen when very very rarely an array is corrupted, invalid boundaries are set there are only 50 items but the count of items gets corrupted and we end up with < 0 or > 50. Core dumps show this. By determining where this array bound comes from and verifying the correct value we can prevent this issue, but then the problem migrates to another location. Since this only affects a single customer and single transaction type that indicates to me something related to this customer or transaction. But that tree has borne no fruit.

Due to how infrequent this occurs I cannot rule out:

  1. Another thread corruption
  2. Some data race condition
  3. Thread race condition
  4. Programming error writing to place it shouldn't.

I am unable to run any of the above tooling (ASAN, electric fence, ...) in an environment that simulates the conditions that cause this. But I cannot replicate it any any environment that I can run these toolings.

My only thoughts would be to:

  1. Create deep copies or serialize these objects and splice checks throughout the code base. (messy, might be impossible due to memory constraints)
  2. Ignore the problem (unevenly effecting a single customer means I cannot do this)
  3. Keep playing ping pong and make the codebase even more ugly with all these error checks looking for corruption.
  4. Rewrite it all in $Language (not really an option)
  5. Try a new pool allocator or arena allocator to see if there's an uknown bug in our custom one.

I'm looking for novel approaches to this that I haven't considered. Ways to automate this, better tooling for these type of problems. How do you validate that an object hasn't changed behind your back?

Upvotes: 1

Views: 322

Answers (1)

jxh
jxh

Reputation: 70392

Although this question is likely to be closed as off-topic, a general purpose tool that you can use to help track down the root cause in the future is to implement an in-memory ring buffer to record critical events. It is different from regular logs because the log is only to memory, and thus has very low latency. If you have enough memory dedicated to this log, you should be able to inspect it on the next crash for the customer, and get a better idea of the events that led up to the corruption.

A very basic implementation would be:

static_assert(0 == (LR_TAPE_SIZE & (LR_TAPE_SIZE-1)),
        "LR_TAPE_SIZE must be a power of 2");
static_assert(LR_TAPE_SIZE > (LR_LOG_MAX + 1),
        "LR_TAPE_SIZE must be larger than LR_LOG_MAX");

struct lr_tape {
    uint32_t wrap :  1;
    uint32_t head : 31;
    char tape[LR_TAPE_SIZE];
};

int
lr_write(struct lr_tape *lr, const void *buf, uint32_t sz)
{
    uint32_t pos = lr->head % LR_TAPE_SIZE;
    uint32_t cnt = LR_TAPE_SIZE - pos;
    memcpy(&lr->tape[pos], buf, (cnt < sz) ? cnt : sz);
    if (cnt < sz) memcpy(&lr->tape[0], buf + cnt, sz - cnt);
    lr->head += sz;
    lr->wrap = lr->wrap || (lr->head >= LR_TAPE_SIZE);
    return sz;
}

You can then implement a simple printf-like wrapper for it.

int
lr_log(struct lr_tape *lr, const char *fmt, ...)
{
    char buf[LR_LOG_MAX + 1];
    va_list ap;
    int r, p;
    va_start(ap, fmt);
    r = vsnprintf(buf, LR_LOG_MAX, fmt, ap);
    va_end(ap);
    if (r <= 0) return r;
    if (r >= LR_LOG_MAX) {
        r = LR_LOG_MAX;
        buf[r-3] = buf[r-2] = buf[r-1] = '.';
    }
    if (buf[r-1] != '\n') buf[r++] = '\n';
    return lr_write(lr, buf, r);
}

And a way to emit it:

void
lr_output(struct lr_tape *lr, FILE *out)
{
    uint32_t pos = lr->head % LR_TAPE_SIZE;
    uint32_t cnt = LR_TAPE_SIZE - pos;
    if (lr->head == 0) return;
    if (lr->wrap) {
        fwrite("...", 3, 1, out);
        fwrite(&lr->tape[pos], cnt, 1, out);
    }
    fwrite(lr->tape, pos, 1, out);
}

Upvotes: 1

Related Questions