Distributed Window Service

I have a Class Library which runs inside a windows service. This library has long running threads to poll email (which could be broken out into tasks), handle messages, etc and works well.

This is part of a product which needs to scale out by adding nodes. I currently define what customers are handled by a single node.

My problem comes if that node goes down, or needs maintenance, manual intervention is needed and data is lost during the downtime. I'd like to come up with a solution that allows it to work like load balanced web servers. If a node goes down, the application can see that and act appropriately.

This is built on C# / .NET and MS SQL Server and would like to stick with those technologies.

I realize this may not be as straight forward as my question seems, but I'm looking for any design patterns or best practices that might be out there to help me build out a solution.

Upvotes: 3

Answers (3)

Legowo Budianto

Reputation: 26

My approach would be distribute that service to several computer and coordinate the service through PAXOS or similiar algorithm to handle the leader election. So when a service down in a node, service in other servers can take up the position. In a more practical way I would definitely use Apache Zookeeper to coordinate the leader election.

Upvotes: 0

Tung

Reputation: 5444

1) Have each installed windows service register itself in the database with a unique id.

2) While your service is alive, send a heartbeat. This heartbeat can be something as simple as an update to a DateTime field of when the service last checked in. You can update a field directly in the database or go through a web service.

3) Create a table that defines a set of tasks, and the assigned unique_id of the machine that's performing that task. This can be first come first serve. A machine can pick up any task it so chooses, and it get exclusive rights to that task by registering itself in this table. I prefer this approach more than a centralize control because you never have to worry about tasks not running when your centralized controller goes down.

4) Define a timeout value for the heartbeat. Each of your distributed service will check for tasks that have either not been picked up or have timed out. The maintenance of the heartbeat for any machine performing a task should not be dependent on how long the task takes. That is, if task A takes 5 minutes, machineA should still update its heartbeat during those 5 minutes so that machineB does not flag it as having gone down.

5) Depending on how complex your task is, you may need a status column that the worker updates.

Upvotes: 4

Toan Nguyen

Reputation: 11601

My design would be a central service which will maintain and distribute jobs, and other worker services which actually handle the jobs. So when there are some jobs to be done, they will added to a queue on the central service, the service will notify the worker services. Next, each worker will try to get a job to perform on. If a job is allocated to a worker, the worker will update the status of the job depending on whether or not it succeeds or fails to complete the job. By using that design, you can easily scale out to as many worker services as you wish, and if one or two workers down don't affect the rest because the job is considered uncompleted, so the other workers can pick up and process it.

Upvotes: 0

Distributed Window Service

Answers (3)

Related Questions