Reputation: 13713
I'm not sure exactly where (or even how exactly to ask) this question, so I'm hoping someone here can point me in the right direction.
I have a service that I'm building. That service has different objects in memory - each with it's own state. Whenever an object is created it loads the state from the database and hold it. When changes are made to the object they are also persistent to the database.
I would like to scale this service. I have looked at solutions such as akka.net (actor model) and they have a clustering solution. From what I've read, it synchronizes the state with something they call "gossip" where each node sends the state to the other node. I'm not sure that it really possible to convert my working application to akka.net at this point.
I'm wondering exactly how clusters keep state synced between different nodes (I get the gossip concept), what happens if I have machine A that receives a message and at the same time, machine B also receives a message - both change the same state of an object - that will make problems with data integrity between states. My only thought about this is to lock a shared resource, but that defeats the purpose of the cluster.
Keeping state in the database is also not an option since the database becomes a bottleneck and a single point of failure.
I can't seem to find any relevant reading materials online - but I'm also lacking the technical phrases I need to focus on.
In case it's relevant, I'm using .NET Core and c# for development.
Can anyone explain the concept of clustering, how it works and make sure nodes are at sync? or can point to the right direction?
Upvotes: 0
Views: 91
Reputation: 1052
You have a big problem. I think that the way you are thinking about the problem is a bigger problem. Let's go through some basics.
Clustering is used to solve big problems, much like the "eat an elephant" problem. You could to solve this problem design a unique bigger predator with a huge mouth. But history and paleontology has shown us that big predators are not easily sustained (they are expensive on the environment).
So to solve your problem, you could take a bigger stronger server.
Or, you could use clustering.
Clustering solves the "eat the elephant" problem in a very different way. Instead of sending a unique huge predator with a huge mouth to eat the elephant, it will use a concept of distributed and shared processing to eat it one bite at a time. When done properly, ants could eat the elephant. If there are enough of them and the circumstances are correct.
But notice in my example, ants are very small... A single ant will never carry the entire elephant. You could carry the entire elephant if all the ants worked together but then you run into concurrency and locking problems (you must coordinate the ants).
Ants have shown us a much better way to deal with this. They will take a piece of the elephant and deal with the problem in smaller chunks.
In your system you ask how you can sync data across nodes... My question would be why? If you are syncing data then you are mirroring and your problem becomes even bigger (you are cloning the elephant but can only eat the original).
The solution to your problem is to rethink the solution and see if you can break down the problem into smaller pieces.
In Akka and in the Actor pattern the best way to deal with problems is to use smaller "processes" (a single ant). While the process on its own is almost useless, when used in a large scale they can become very powerful. When the architecture is properly done you will notice that taking a flamethrower to ants will not defeat them... More ants will come, they will continue to work on the problem.
Copying and syncing data is not your solution, clustering it is. You must take your data and break it down to a point where you can give it to a single ant. If you can do this then you can use Akka. If this approach seems ludicrous then Akka is not for you.
But consider this... You obviously have concerns over your database backend - you don't want to increase IO and introduce a single point of failure. I would have to agree with you. But you need to rethink things. You could have database mirroring to remove the single point of failure but you are correct that this won't remove the bottleneck. So let's say that mirror removes the single point of failure... Now let's attack the bottleneck portion.
If you can split up your data into small enough chunks that ants can handle it then I would urge you to tell your ants to only report to the database when the data changes... You can read it once on initialization (you need a backend store, don't kid yourself, electricity can be quickly lost... it must be saved somewhere) but if you tell your ants to persist only changed data then you will remove all the queries from the equation which will drastically shift where the load is coming from. Once you only have updates, inserts and deletes to deal with... the whole landscape will be much simpler.
Clustering should be the solution for you, but only if you can take the concept of mirror away from your mind.
Cluster nodes can and will crash... But they can be respawned elsewhere to other nodes, so that you always have a quick system. Only when you deal with a crash or loss of a node/worker process/ant will you have to reload data...
Good luck... you have outlined a formidable problem that I have seen people with software engineering degrees fail at solving.
Upvotes: 2