zx_wing
zx_wing

Reputation: 1966

Design patterns to recover system from intermediate state after crashing

My system is made up of couple of components, a request typically goes through all components and each component uses own DB table to track system states.

For example, when a request arrives, component A creates a resource R by: 1. create DB row for R, marking state as "Creating" 2. application layer does the real work which may takes up to couple of minutes or hours. 3. update DB row for R, marking state as "Ready"

every component does similar things.

The problem is, the system may crash at any time and leave the system in an intermediate state. For example, resource R may remain in "Creating" after system failure.

My question is, for system like this which can not use a transaction to cover all steps(either the transaction is too long or the system is distributed), what're the design patterns or best practice to recover system?

I thought this case is very common in ERP system or any system that uses SOA.

UPDATE: The request can be resent, but the resource R which is in intermediate state 'Creating' which may have been created in real world, this is somehow like in a distributed system, a component crash causes whole system states inconsistent. what's some practice to design a system that can resync system after failure?

Upvotes: 1

Views: 433

Answers (1)

ali köksal
ali köksal

Reputation: 1227

You can route your requests as JMS messages over the components of your system. That way you can delegate the task of message persistence and delivery guarantee to JMS implementation (eg. Active MQ). If a component crashes, the message will be redelivered to it.

Following section is added per OP's comment.

UPDATE: The request can be resent, but the resource R which is in intermediate state 'Creating' which may have been created in real world, this is somehow like in a distributed system, a component crash causes whole system states inconsistent. what's some practice to design a system that can resync system after failure?

This is highly dependent on the nature of the system in question and its components, here is one way to accomplish failure-resistant systems.

1) Messages between components should not be lost and their delivery should be guaranteed. This can be accomplished via a dedicated message queue.

2) Each operation should be idempotent, can be invoked more than once without any additional side effects. That way if an error occurs during the message processing, message queue will send the message again and the component will handle the message, eg. check its completion status against its local state and perform only necessary steps to complete the operation, skipping the already completed ones.

For a more complete answer and system design guides please take a look at WS-BPEL

Upvotes: 2

Related Questions