Reputation: 1567
I have a database table that contains some records to be processed. The table has a flag column that represents the following status values. 1 - ready to be processed, 2- successfully processed, 3- processing failed.
The .net code (repeating process - console/service) will grab a list of records that are ready to be processed, and loop through them and attempt to process them (Not very lengthy), update status based on success or failure.
To have better performance, I want to enable multithreading for this process. I'm thinking to spawn say 6 threads, each threads grabbing a subset.
Obviously I want to avoid having different threads process the same records. I dont want to have a "Being processed" flag in the database to handle the case where the thread crashes leaving the record hanging.
The only way I see doing this is to grab the complete list of available records and assigning a group (maybe ids) to each thread. If an individual thread fails, its unprocessed records will be picked up next time the process runs.
Is there any other alternatives to dividing the groups prior to assigning them to threads?
Upvotes: 7
Views: 3104
Reputation:
Here is one approach that does not rely/use an additional database column (but see #4) or mandate an in-process queue. The premise this approach is to "shard" records across workers based on some consistent value, much like a distributed cache.
Here are my assumptions:
Assign each thread a unique bucket
value from [0, thread_count)
. If a thread dies/is restarted it will take the same bucket as that which it vacated.
Then, each time a thread needs a new record is needed it will fetch from the database:
SELECT *
FROM record
WHERE state = 'unprocessed'
AND (id % $thread_count) = $bucket
ORDER BY date
There could of course be other assumptions made about reading the "this threads tasks" in batch and storing them locally. A local queue, however, would be per thread (and thus re-loaded upon a new thread startup) and thus it would only deal with records associated for the given bucket
.
When the thread is finished processing a record should mark the record as processed using the appropriate isolation level and/or optimistic concurrency and proceed to the next record.
Upvotes: 0
Reputation: 150138
The most straightforward way to implement this requirement is to use the Task Parallel Library's
Parallel.ForEach (or Parallel.For).
Allow it to manage individual worker threads.
From experience, I would recommend the following:
Alternatively
Consider using a transactional queue (MSMQ or Rabbit MQ come to mind). They are optimized for this very problem.
That would be my clear choice, having done both at massive scale.
Optimizing
If it takes a non-trivial amount of time to retrieve data from the database, you can consider a Producer/Consumer pattern, which is quite straightforward to implement with a BlockingCollection. That pattern allows one thread (producer) to populate a queue with DB records to be processed, and multiple other threads (consumers) to process items off of that queue.
A New Alternative
Given that several processing steps touch the record before it is considered complete, have a look at Windows Workflow Foundation as a possible alternative.
Upvotes: 6
Reputation: 896
I remember doing something like what you described...A thread checks from time to time if there is something new in database that needs to be processed. It will load only the new ids, so if at time x last id read is 1000, at x+1 will read from id 1001.
Everything it reads goes into a thread safe Queue. When items are added to this queue, you notify the working threads (maybe use autoreset events, or spawn threads here). each thread will read from this thread safe queue one item at a time, until the queue is emptied.
You should not assign before the work foreach thread (unless you know that foreach file the process takes the same amount of time). if a thread finishes the work, then it should take the load from the other ones left. using this thread safe queue, you make sure of this.
Upvotes: 2