SebastianStehle
SebastianStehle

Reputation: 2459

Failover strategies for stateful servers

in our project, we have a stateful server. The server runs a rule engine (Drools) and exposes functionality using a rest service. It is monitoring system and it is very critical to have an uptime or more less 100%. Therefore we also need strategies to shut down a server for maintainance and to have strategies to be able to continue monitoring of an agent when one server is offline.

The first might be to put a message queue or service bus in front of the drools servers to keep messages that have not been processed and to have mechanisms to backup the state of the server to a database or another storage. This makes it possible to shut down the server for a few minutes to deploy a new version. But the question is, what to do when one server goes offline unexpectedly. Are there any failover strategies for stateful servers, what is your experience? And advice is welcome.

Upvotes: 0

Views: 756

Answers (1)

Steve
Steve

Reputation: 9480

There's no 'correct' way that I can think of. It rather depends on things like:

  1. sensitivity to changes over time windows.
  2. how quickly your application needs to be brought back up.
  3. impact if events are missed.
  4. impact if the events it is monitoring are not up to the second.
  5. how the application raises events back to the outside world.

Some ideas for enabling fail-over:

  1. Start from a clean slate. Examine the most serious impact of this before spending time developing anything else.
  2. Load a list of facts (today's transactions perhaps) from a database. Potentially replay in order. Possibly whilst using a pseudo clock. I'm aware of this being used for some pricing applications in the financial sector, although at the same time, I'm also aware that some of those systems can take a very long time to catch up due to the number of events that need to be replayed.
  3. Persist the stateful session periodically. The interval to be determined based on how far behind the DR application is permitted to be, and how long it takes to persist a session. This way, the DR application can retrieve the same session from the database. However, there will be a gap in events received based on the interval between persists. Of course, if the reason for failure is corruption of the session, then this doesn't work so well.
  4. Configure middleware to forward events to 2 queues, and subscribe primary and DR applications to those queues. This way, both monitors should be in sync and able to make decisions based on the last 1 minute of activity. Note that if one leg is taken out for a period then it will need to catch up, and your middleware needs capacity to store multiple hours (however long an outage might be) worth of events on a queue. Also, your rules need to work off the timestamp on the event itself when queued, rather than the current time. Otherwise, when bringing a leg back after an outage, it could well raise alerts based on events in a time window.

An additional point to consider when replaying events is that you probably don't want any alerts to be raised to the outside world until you have completed the replay. For instance you probably don't want 50 alert emails sent to say that ApplicationX is down, up, down, up, down, up, ...

I'll assume that a monitoring application might be pushing alerts to the outside world in some form. If you have a hot-hot configuration as in 4, you also need to control your alerts. I would be tempted to deal with this by configuring each to push alerts to its own queue. Then middleware could forward alerts from the secondary monitor to a dead letter queue. Failover would be to reconfigure middleware so that primary alerts go to the dead letter queue and secondary alerts go to the alert channel. This mechanism could also be used to discard events raised during a replay recovery.

Given the complexity and potential mess that can arise from replaying events, for a monitoring application I would probably prefer starting from a clean slate, or going with persisted sessions. However this may well depend on what you are monitoring.

Upvotes: 1

Related Questions