Reputation: 1286
For the next generation of one of our products, I have been asked to design a system that has both failover capability (ie there are several nodes, and if one of the nodes crashes there is minimal / no data loss) and load balancing (so each of the nodes only handles part of the data). What I can't quite grok is how I can do both. Suppose a node has all the data but only processes an agreed subset. It changes element 8, say. Now all the other nodes have the wrong element 8. So I need to sync - tell all the other nodes element 8 changed - to maintain integrity. But surely that just makes a mockery of load-balancing?!
Upvotes: 1
Views: 255
Reputation: 21976
The short answer is, it depends very much on your application architecture.
It sounds like you are thinking about this using a bad design anti-pattern -- trying to solve for scale-out processing and disaster recovery at the same time in the same layer. If each node only handles part of the data, then it can't be a failover for the other nodes. A lot of people fall into this trap, since both scale-out and DR can be implemented using a type of federation ... but don't confuse the mechanism with the objective. I would respectfully submit you need to think about this problem a little differently.
The way to approach this problem is in two entirely separate layers:
Layer 1 -- app. Devise a high-level design for your app as if there is no requirement for DR. Ignore the fact there may be another instance of this app elsewhere that will be used in DR. Focus on functional & performance aspects of your app -- what the distinct subsystems should be, if any should scale out for workload reasons. This app as a whole handles 100% of the data -- decide if there is a scale-out / federation approach needed within the app itself -- that does not relate to the DR requirement.
Layer 2 -- DR. Now think of your app as a black box. How many instances of the black box will you need to meet your availability requirements, and how will you maintain the required degree of synchronization between those instances? What are the performance requirements for the failover & recovery (time to availability, allowable data loss if any, how long before you need the next failover env up & running)?
Back to Layer 1 -- choose an implementation approach for your high-level design that uses the recovery approach and tools you identified in Layer 2. For example, if you will use a master-slave DB approach for data synchronization among DR nodes, store everything you want to preserve in a failover in the DB layer, not in app-node-local files or memory. These choices depend on the DR framework you choose.
The design of the app layer and DR layer are related, but if you pick the right tools & approach, they don't have to be strongly coupled. E.g. in Amazon Web Services, you can use IP load balancing to forward requests to the failover app instance, and if you store all relevant data (including sessions and other transient things) in a database and use the DBMS native replication capability, it's pretty simple.
Bottom line:
Good luck
Upvotes: 1