Reputation: 81
I have a heisenbug that occurs so infrequently that it's not replicatable on any environment, it fails spectacularly and I have no idea how to diagnose it.
The bug is related to memory usage. The corruption doesn't fit into the defined four categories of corruption.
Tooling shows that it is not:
I say this with some confidence because while I cannot replicate it, logging and tooling in the higher transaction environments indicate the above do not happen.
I'm compiling gcc c11 no optimization, Wall, other minimal flags.
ASAN, electric fence, hellgrind, memcheck, cppcheck find no problems.
Heap management appears to work well with a pool allocator, boundary checks, corruption sentinel.
Absolutely no unit tests
The issue is primarily seen when very very rarely an array is corrupted, invalid boundaries are set there are only 50 items but the count of items gets corrupted and we end up with < 0 or > 50. Core dumps show this. By determining where this array bound comes from and verifying the correct value we can prevent this issue, but then the problem migrates to another location. Since this only affects a single customer and single transaction type that indicates to me something related to this customer or transaction. But that tree has borne no fruit.
Due to how infrequent this occurs I cannot rule out:
I am unable to run any of the above tooling (ASAN, electric fence, ...) in an environment that simulates the conditions that cause this. But I cannot replicate it any any environment that I can run these toolings.
My only thoughts would be to:
I'm looking for novel approaches to this that I haven't considered. Ways to automate this, better tooling for these type of problems. How do you validate that an object hasn't changed behind your back?
Upvotes: 1
Views: 322
Reputation: 70392
Although this question is likely to be closed as off-topic, a general purpose tool that you can use to help track down the root cause in the future is to implement an in-memory ring buffer to record critical events. It is different from regular logs because the log is only to memory, and thus has very low latency. If you have enough memory dedicated to this log, you should be able to inspect it on the next crash for the customer, and get a better idea of the events that led up to the corruption.
A very basic implementation would be:
static_assert(0 == (LR_TAPE_SIZE & (LR_TAPE_SIZE-1)),
"LR_TAPE_SIZE must be a power of 2");
static_assert(LR_TAPE_SIZE > (LR_LOG_MAX + 1),
"LR_TAPE_SIZE must be larger than LR_LOG_MAX");
struct lr_tape {
uint32_t wrap : 1;
uint32_t head : 31;
char tape[LR_TAPE_SIZE];
};
int
lr_write(struct lr_tape *lr, const void *buf, uint32_t sz)
{
uint32_t pos = lr->head % LR_TAPE_SIZE;
uint32_t cnt = LR_TAPE_SIZE - pos;
memcpy(&lr->tape[pos], buf, (cnt < sz) ? cnt : sz);
if (cnt < sz) memcpy(&lr->tape[0], buf + cnt, sz - cnt);
lr->head += sz;
lr->wrap = lr->wrap || (lr->head >= LR_TAPE_SIZE);
return sz;
}
You can then implement a simple printf
-like wrapper for it.
int
lr_log(struct lr_tape *lr, const char *fmt, ...)
{
char buf[LR_LOG_MAX + 1];
va_list ap;
int r, p;
va_start(ap, fmt);
r = vsnprintf(buf, LR_LOG_MAX, fmt, ap);
va_end(ap);
if (r <= 0) return r;
if (r >= LR_LOG_MAX) {
r = LR_LOG_MAX;
buf[r-3] = buf[r-2] = buf[r-1] = '.';
}
if (buf[r-1] != '\n') buf[r++] = '\n';
return lr_write(lr, buf, r);
}
And a way to emit it:
void
lr_output(struct lr_tape *lr, FILE *out)
{
uint32_t pos = lr->head % LR_TAPE_SIZE;
uint32_t cnt = LR_TAPE_SIZE - pos;
if (lr->head == 0) return;
if (lr->wrap) {
fwrite("...", 3, 1, out);
fwrite(&lr->tape[pos], cnt, 1, out);
}
fwrite(lr->tape, pos, 1, out);
}
Upvotes: 1