Sam
Sam

Reputation: 307

Best practices for recovering from a segmentation fault

I am working on a multithreaded process written in C++, and am considering modifying SIGSEGV handling using google-coredumper to keep the process alive when a segmentation fault occurs.

However, this use of google-coredumper seems ripe with opportunities to get stuck in an infinite loop of core dumps unless I somehow reinitialize the thread and the object that may have caused the core dump.

What best practices should I keep in mind when trying to keep a process alive through a core dump? What other 'gotchas' should I be aware of?

Thanks!

Upvotes: 15

Views: 12566

Answers (6)

parapura rajkumar
parapura rajkumar

Reputation: 24403

The best practice is to fix the original issue causing the core dump, recompile and then relaunch the application.

To catch these errors before deploying in the wild, do plenty of peer review and write lots of tests.

Upvotes: 15

mycal
mycal

Reputation: 71

Steve's answer is actually a very useful formula. I've used something similar in a piece of complicated embedded software where there was at least one SIGSEGV error in the code that we could not track down by ship time. As long as you can reset your code to have no ill effects (memory or resource leaks) and the error is not something that causes an endless loop it can be a lifesaver (even though its better to fix the bug). FYI in our case it was single thread.

But what is left out is that once you recover from your signal handler, it will not work again unless you unmask the signal. Here is a chunk of code to do that:

sigset_t signal_set;
...
setjmp(buf);
sigemptyset(&signal_set);
sigaddset(&signal_set, SIGSEGV); 
sigprocmask(SIG_UNBLOCK, &signal_set, NULL); 
// Initialize all Variables...

Be sure to free up your memory, sockets and other resources or you could leak memory when this happens.

Upvotes: 7

user1735527
user1735527

Reputation: 573

It is actually possible in C. You can achieve it in quite a complicated way:

1) Override signal handler

2) Use setjump() and longjmp() to set the place to jump back, and to actually jump back to there.

Check out this code I wrote (idea taken from "Expert C Programming: Deep C Secrets" by Peter Van Der Linden):

#include <signal.h>
#include <stdio.h>
#include <setjmp.h>

//Declaring global jmp_buf variable to be used by both main and signal handler
jmp_buf buf;


void magic_handler(int s)
{

    switch(s)
    {

        case SIGSEGV:
        printf("\nSegmentation fault signal caught! Attempting recovery..");
        longjmp(buf, 1);
        break;
    }

    printf("\nAfter switch. Won't be reached");

}



int main(void) 
{

    int *p = NULL;

    signal(SIGSEGV, magic_handler);

    if(!setjmp(buf))
    {

         //Trying to dereference a null pointer will cause a segmentation fault, 
         //which is handled by our magic_handler now.
         *p=0xdead;

    }
    else
    {
        printf("\nSuccessfully recovered! Welcome back in main!!\n\n"); 
    }



    return 0;
}

Upvotes: 47

R. Martinho Fernandes
R. Martinho Fernandes

Reputation: 234504

If a segmentation fault occurs, you're better off just ditching the process. How can you know that any of your process's memory is usable after this? If something in your program is messing with memory it shouldn't, why do you believe it didn't mess with some other part of memory that your process actually can access without segfaulting?

I think that doing this will mostly benefit attackers.

Upvotes: 4

thiton
thiton

Reputation: 36049

My experience with segmentation faults is that it's very hard to catch them portably, and to do it portably in a multithreaded context is next to impossible.

This is for good reason: Do you really expect the memory (which your threads share) to be intact after a SIGSEGV? After all, you've just proven that some addressing is broken, so the assumption that the rest of the memory space is clean is pretty optimistic.

Think about a different concurrency model, e.g. with processes. Processes don't share their memory or only a well-defined part of it (shared memory), and one process can reasonably work on when another process died. When you have a critical part of the program (e.g. the core temperature control), putting it in an extra process protects it from memory corruption by other processes and segmentation faults.

Upvotes: 5

Victor Sorokin
Victor Sorokin

Reputation: 12006

From description of coredumper seems it's purpose not what you intending, but just allowing to make snapshots of process memory.

Personally, I wouldn't keep process after it triggered core dump -- it just so many ways it could be broken -- and would employ some persistence to allow data recovery after process is restarted.

And, yes, as parapura has suggested, better yet, find out what causing SIGSEGV and fix it.

Upvotes: 1

Related Questions