Tomislav Jonjic
Tomislav Jonjic

Reputation: 206

Debugging crashes in production environments

First, I should give you a bit of context. The program in question is a fairly typical server application implemented in C++. Across the project, as well as in all of the underlying libraries, error management is based on C++ exceptions.

My question is pertinent to dealing with unrecoverable errors and/or programmer errors---the loose equivalent of "unchecked" Java exceptions, for want of a better parallel. I am especially interested in common practices for dealing with such conditions in production environments.

For production environments in particular, two conflicting goals stand out in the presence of the above class of errors: ease of debugging and availability (in the sense of operational performance). Each of these suggests in turn a specific strategy:

So I end-up with two half-baked solutions; I would like a compromise between service availability and debugging facilities. What am I missing ?

Note: I have flagged the question as C++ specific, as I am interested in solutions and idiosyncrasies that apply to it particular; nonetheless, I am aware there will be considerable overlap with other languages/environments.

Upvotes: 5

Views: 1624

Answers (2)

Matthieu M.
Matthieu M.

Reputation: 299999

Disclaimer: Much like the OP I code for servers, thus this entire answer is focused on this specific use case. The strategy for embedded software or deployed applications should probably be widely different, no idea.

First of all, there are two important (and rather different) aspects to this question:

  • Easing investigation (as much as possible)
  • Ensuring recovery

Let us treat both separately, as dividing is conquering. And let's start by the tougher bit.


Ensuring Recovery

The main issue with C++/Java style of try/catch is that it is extremely easy to corrupt your environment because try and catch can mutate what is outside their own scope. Note: contrast to Rust and Go in which a task should not share mutable data with other tasks and a fail will kill the whole task without hope of recovery.

As a result, there are 3 recovery situations:

  • unrecoverable: the process memory is corrupted beyond repairs
  • recoverable, manually: the process can be salvaged in the top-level handler at the cost of reinitializing a substantial part of its memory (caches, ...)
  • recoverable, automatically: okay, once we reach the top-level handler, the process is ready to be used again

An completely unrecoverable error is best addressed by crashing. Actually, in a number of cases (such as a pointer outside your process memory), the OS will help in making it crash. Unfortunately, in some cases it won't (a dangling pointer may still point within your process memory), that's how memory corruptions happen. Oops. Valgrind, Asan, Purify, etc... are tools designed to help you catch those unfortunate errors as early as possible; the debugger will assist (somewhat) for those which make it past that stage.

An error that can be recovered, but requires manual cleanup, is annoying. You will forget to clean in some rarely hit cases. Thus it should be statically prevented. A simple transformation (moving caches inside the scope of the top-level handler) allows you to transform this into an automatically recoverable situation.

In the latter case, obviously, you can just catch, log, and resume your process, waiting for the next query. Your goal should be for this to be the only situation occurring in Production (cookie points if it does not even occur).


Easing Investigation

Note: I will take the opportunity to promote a project by Mozilla called rr which could really, really, help investigating once it matures. Check the quick note at the end of this section.

Without surprise, in order to investigate you will need data. Preferably, as much as possible, and well ordered/labelled.

There are two (practiced) ways to obtain data:

  • continuous logging, so that when an exception occurs, you have as much context as possible
  • exception logging, so that upon an exception, you log as much as possible

Logging continuously implies performance overhead and (when everything goes right) a flood of useless logs. On the other hand, exception logging implies having enough trust in the system ability to perform some actions in case of exceptions (which in case of bad_alloc... oh well).

In general, I would advise a mix of both.

Continuous Logging

Each log should contain:

  • a timestamp (as precise as possible)
  • (possibly) the server name, the process ID and thread ID
  • (possibly) a query/session correlator
  • the filename, line number and function name of where this log came from
  • of course, a message, which should contain dynamic information (if you have a static message, you can probably enrich it with dynamic information)

What is worth logging ?

At least I/O. All inputs, at least, and outputs can help spotting the first deviation from expected behavior. I/O include: inbound query and corresponding response, as well as interactions with other servers, databases, various local caches, timestamps (for time-related decisions), ...

The goal of such logging is to be able to reproduce the issue spotted in a control environment (which can be setup thanks to all this information). As a bonus, it can be useful as crude performance monitor since it gives some check-points during the process (note: I am talking about monitoring and not profiling for a reason, this can allow you to raise alerts and spot where, roughly, time is spent, but you will need more advanced analysis to understand why).

Exception Logging

The other option is to enrich exception. As an example of a crude exception: std::out_of_range yields the follow reason (from what): vector::_M_range_check when thrown from libstdc++'s vector.

This is pretty much useless if, like me, vector is your container of choice and therefore there are about 3,640 locations in your code where this could have been thrown.

The basics, to get a useful exception, are:

  • a precise message: "access to index 32 in vector of size 4" is slightly more helpful, no ?
  • a call stack: it requires platform specific code to retrieve it, though, but can be automatically inserted in your base exception constructor, so go for it!

Note: once you have a call-stack in your exceptions, you will quickly find yourself addicted and wrapping lesser-abled 3rd party software into an adapter layer if only to translate their exceptions into yours; we all did it ;)

On top of those basics, there is a very interesting feature of RAII: attaching notes to the current exception during unwinding. A simple handler retaining a reference to a variable and checking whether an exception is unwinding in its destructor costs only a single if check in general, and does all the important logging when unwinding (but then, exception propagation is costly already, so...).

Finally, you can also enrich and rethrow in catch clauses, but this quickly litters the code with try/catch blocks so I advise using RAII instead.

Note: there is a reason that std exceptions do NOT allocate memory, it allows throwing exceptions without the throw being itself preempted by a std::bad_alloc; I advise to consciously pick having richer exceptions in general with the potential of a std::bad_alloc thrown when attempting to create an exception (which I have yet to see happening). You have to make your own choice.

And Delayed Logging ?

The idea behind delayed logging is that instead of calling your log handler, as usual, you will instead defer logging all finer-grained traces and only get to them in case of issue (aka, exception).

The idea, therefore, is to split logging:

  • important information is logged immediately
  • finer-grained information is written to a scratch-pad, which can be called to log them in case of exception

Of course, there are questions:

  • the scratch pad is (mostly) lost in case of crash; you should be able to access it via your debugger if you get a memory dump though it's not as pleasant.
  • the scratch pad requires a policy: when to discard it ? (end of the session ? end of the transaction ? ...), how much memory ? (as much as it wants ? bounded ? ...)
  • what of the performance cost: even if not writing the logs to disk/network, it still cost to format them!

I have actually never used such a scratch pad, for now all non-crasher bugs that I ever had were solved solely using I/O logging and rich exceptions. Still, should I implement it I would recommend making it:

  • transaction local: since I/O is logged, we should not need more insight that this
  • memory bounded: evicting older traces as we progress
  • log-level driven: just as regular logging, I would want to be able to only enable some logs to get into the scratch pad

And Conditional / Probabilistic Logging ?

Writing one trace every N is not really interesting; it's actually more confusing than anything. On the other hand, logging in-depth one transaction every N can help!

The idea here is to reduce the amount of logs written, in general, whilst still getting a chance to observe bugs traces in detail in the wild. The reduction is generally driven by the logging infrastructure constraints (there is a cost to transferring and writing all those bytes) or by the performance of the software (formatting the logs slows software down).

The idea of probabilistic logging is to "flip a coin" at the start of each session/transaction to decide whether it'll be a fast one or a slow one :)

A similar idea (conditional logging) is to read a special debug field in a transaction field that initiates a full logging (at the cost of speed).

A quick note on rr

With an overhead of only 20%, and this overhead applying only on the CPU processing, it might actually be worth using rr systematically. If this is not feasible, however, it could be feasible to have 1 out of N servers being launched under rr and used to catch hard to find bugs.

This is similar to A/B testing, but for debugging purposes, and can be driven either by a willing commitment of the client (flag in the transaction) or with a probabilistic approach.

Oh, and in the general case, when you are not hunting down anything, it can be easily deactivated altogether. No sense in paying those 20% then.


That's all folks

I could apologize for the lengthy read, but the truth I probably just skimmed the topic. Error Recovery is hard. I would appreciate comments and remarks, to help improve this answer.

Upvotes: 3

utnapistim
utnapistim

Reputation: 27375

If the error is unrecoverable, by definition there is nothing the application can do in production environment, to recover from the error. In other words, the top-level exception handler is not really a solution. Even if the application displays a friendly message like "access violation", "possible memory corruption", etc, that doesn't actually increase availability.

When the application crashes in a production environment, you should get as much information as possible for post-mortem analysis (your second solution).

That said, if you get unrecoverable errors in a production environment, the main problems are your product QA process (it's lacking), and (much before that), writing unsafe/untested code.

When you finish investigating such a crash, you should not only fix the code, but fix your development process so that such crashes are no longer possible (i.e. if the corruption is an uninitialized pointer write, go over your code base and initialize all pointers and so on).

Upvotes: 0

Related Questions