Reputation: 53

Why Kernel Can't Handle Crash Gracefully

For user mode application, incorrect page access doesn't create a lot of trouble other than the application crash and the application crash can be done gracefully by exception handling. Why can't we do the same for kernel crash. So when a kernel module tries to access some invalid address, there is a page fault and the kernel crash. Why it can't be handled gracefully like unloading the faulty module.

More specifically I'm interested to know if it is completely impossible or possible. I am not inclined to know the difficulties it might pose in using the system. I understand a driver crash will result unusable device and I'm okay with that. The only thing is whether it is possible to gracefully unload a faulty driver.

Upvotes: 0

Answers (3)

Alexandre Belloni

Reputation: 2304

The kernel modules and the kernel itself share the same address space. There is simply no protection if a modules starts to misbehave and overwrite memory from another subsystem. So, when a driver crashes, it may or may not stay local to that driver. If you are lucky, you still have a somewhat functional kernel and can continue to work. That doesn't happen with userspace because the address space for each process is separate and so it is possible to catch erroneous memory access and stop the process (this is a SEGFAULT).

Upvotes: 0

Nemanja Boric

Reputation: 22157

As other answer explains very well why it's not feasible to recover from the kernel crashes, I'll try to tell something else.

There is a lot of research in this area, most notably from prof. Andy Tanenbaum with his MINIX. While the kernel crash is still fatal for the MINIX, MINIX kernel is very simple (micro-kernel) narrowing the space for bugs and inside it most other stuff (including drivers) is running as a user-mode process. So, in case of network driver failure, as they are running in the separate address space, all kernel needs to do is to attempt to restart the driver.

Of course, there are areas where you can't recover (or still can't recover), like in case of the file system crash (see the recent discussion here).

There are several good papers on this topic such as http://pages.cs.wisc.edu/~swami/papers/thesis.pdf and I would highly recommend watching Tanenbaum's videos such this one (title is "MINIX 3: A Reliable and Secure Operating System" in case it ever goes offline).

I think this addresses your comment:

We should be able to unload the faulty module. Why can't we? That is my question. Is it a design choice for security or its not possible at all. If it is a design choice, what factors forced us to make that choice

You could live without screen if graphics driver module crashes. However, we can't unload the faulty module and continue because if it crashed and it runs in the same address space as kernel, you don't know if it poisoned the kernel memory - security is the prime factor here.

Upvotes: 5

Brian Malehorn

Reputation: 2685

That's kind of like saying "if you wrap all your Java code in a try/catch block, you've eliminated all bugs!"

There are a number of "errors" that are caught, e.g. kalloc returns NULL if it's out of memory, USB code returns errors if there's no USB, etc. But there's no try/catch for the entire operating system, because not all bugs can be repaired.

As pointed out, what happens if your filesystem module crashes? Keep running without files? How about your ethernet driver? Now your box is cut off from the internet and you can't even ssh into it anymore, but doesn't even have the decency to reboot.

So even though it may be possible for the kernel to not "crash" when a module crashes, the state of the kernel could be arbitrarily broken. The kernel could stay alive without a screen, filesystem or internet connection, but what kind of existence is that?

Upvotes: 2

Why Kernel Can&#39;t Handle Crash Gracefully

Answers (3)

Related Questions

Why Kernel Can't Handle Crash Gracefully