Architecture: Distributed Data Processing w/ Node Failure Tolerance

I have lots of pieces (tens of thousands) of data/jobs that need frequent recurring processing. The jobs to be processed are stored in a SQL Server 2012 (Web Ed) database that gets updated with new jobs frequently and/or may have jobs deleted from it.

Sample:

Id     |  WorkItem
1      |  Copy X to Y
2      |  Ping stackoverflow.com
3      |  Verify backupset
4      |  Send an email

My goal is to distribute the job processing across multiple nodes, both for performance reasons and to ensure jobs get processed even if a node fails.

Two considerations:

I need to be sure that all jobs eventually get executed by some node, and that this continually happens as jobs are added to the database driven queue.
I'd like all nodes to some work, and prevent a single node from fetching most of the work, hence implementing some kind of round-robin seems to make sense.

Because the entire system needs to be resistant to node failure, I can't really "assign" the work to nodes as I don't know what's up and what isn't. My initial thoughts to work around this were to have each node "fetch" one or more jobs from the database, process them, and then return data to the SQL database. However, there are some issues then:

Nodes need to be aware what other nodes are already processing, so jobs don't get incorrectly processed twice. This means jobs would need to be marked as being processed.
Many of these jobs are incredibly small, but need to be executed very frequently (maybe every 10-30 seconds). Constantly updating who is processing jobs and then releasing the job, and then processing again seems like major database overhead.
What happens if a node dies mid-processing? Is there a way to recover from this?

What is the most efficient way to solve this problem? Thank you!

Upvotes: 0

Answers (2)

csg

Reputation: 2107

For sure you need to use a queue mechanism - Sql Server has a builtin queue mechanism called Sql Service Broker.

Upvotes: 0

Yahia

Reputation: 70369

You might need to experiment with the optimal configuration of SQL Server (not an expert for that)... but the basic scheme goes like this:

Have 5 columns to manage job processing
One called DONEUNTIL that acts as a failsafe in case your processing node dies... it is basically an "estimated processing completion point in time"...
Second called NODEID that contains the ID of the NODE processing that job.
Third called DONE that gets set 1 once the job is finished.
Fourth called CREATED which contains the timestamp when the job was put into the table.
Fifth called JOBID which is the primary key.
Have every node frequently cleanup all jobs where DONEUNTIL has passed and DONE != 1 by setting DONEUNTIL and NODEID to NULL
When a Node is ready to take the next job it just selects the JOBID with the oldest CREATED having NULL in NODEID and DONE != 1
It then just updates DONEUNTIL and NODEID appropriately before it starts processing.
After finishing processing it updates DONE = 1.

You could use variations of the above - like having job-priorities, retry-counters etc. or putting the "job-management-information" into a separate table etc.

This scheme works really fine... if your have lots of jobs it might make sense to move jobs with DONE=1 into an archive table... this way your job table only contains active jobs (either waiting to be processed or being processed) which should keep it running smoothly...

Upvotes: 1

Architecture: Distributed Data Processing w/ Node Failure Tolerance

Answers (2)

Related Questions