Martin Kersten
Martin Kersten

Reputation: 5513

Shorting the time between process crash and shooting server in the head?

I have a routine that crashes linux and force a reboot using a system function.

Now I have the problem that I need to crash linux when a certain process dies. Using a script starting the process and if the script ends restart the server is not appropriate since it takes some ms.

Another idea is spawning the shooting processes alongside and use polling of a counter and if the counter is not incremented reboot the server would be another idea.

This would result in an almost instant reaction.

Now the question is what would be a good timeframe. I have no idea how the scheduler of linux would guarantee a certain update of any such counter and what a good timeout would be.

Also I would like to hear some alternatives to this second process spawning. Is there a possibility to advice linux to run a certain routine in case of a crash of the given process or a listener meachanism for the even of problems with a given process?

Upvotes: 1

Views: 44

Answers (1)

viraptor
viraptor

Reputation: 34145

The timeout idea is already implemented in the kernel. You can register any application as a software watchdog, but you'll have to lower the default timeout. Have a look at http://linux.die.net/man/8/watchdog for some ideas. That application can also handle user-defined tests. Realistically unless you're trying to run kernel like linux-rt, having timeouts lower than 100ms can be dangerous on a system with heavy load - especially if the check needs to poll another application.

In cases of application crashes, you can handle them if your init supports notifications. For example both upstart and systemd can do that by monitoring files (make sure coredumps are created in the right place).

But in any case, I'd suggest rethinking the idea of millisecond-resolution restarts. Do you really need to kill the system in that time, or do you just need to isolate it? Just syncing the disks will take a few extra milliseconds and you probably don't want to miss that step. Instead of just killing the host, you could make sure the affected app isn't working (SIGABRT?) and kill all networking (flush iptables, change default to DROP).

Upvotes: 1

Related Questions