Reputation: 1567

Options to use multithreading to process a group of database records?

I have a database table that contains some records to be processed. The table has a flag column that represents the following status values. 1 - ready to be processed, 2- successfully processed, 3- processing failed.

The .net code (repeating process - console/service) will grab a list of records that are ready to be processed, and loop through them and attempt to process them (Not very lengthy), update status based on success or failure.

To have better performance, I want to enable multithreading for this process. I'm thinking to spawn say 6 threads, each threads grabbing a subset.

Obviously I want to avoid having different threads process the same records. I dont want to have a "Being processed" flag in the database to handle the case where the thread crashes leaving the record hanging.

The only way I see doing this is to grab the complete list of available records and assigning a group (maybe ids) to each thread. If an individual thread fails, its unprocessed records will be picked up next time the process runs.

Is there any other alternatives to dividing the groups prior to assigning them to threads?

Upvotes: 7

Answers (3)

user166390

Reputation:

Here is one approach that does not rely/use an additional database column (but see #4) or mandate an in-process queue. The premise this approach is to "shard" records across workers based on some consistent value, much like a distributed cache.

Here are my assumptions:

Re-processing does not cause unwanted side-effects; at most some work "is wasted".
The number of threads is fixed upon start-up. This is not a requirement, but it does simplify the implementation and allows me to skip transitory details in the simple description below.
There is only one "worker process" (but see #1) controlling the "worker threads". This simplifies dealing with how the records are split between workers.
There is some [immutable] "ID" column which is "well distributed". This is required so search worker gets about the same amount of work.
Work can be done "out of order" as long as it is "eventually done". Also, workers might not always run "at 100%" due to each one effectively working on a different queue.

Assign each thread a unique bucket value from [0, thread_count). If a thread dies/is restarted it will take the same bucket as that which it vacated.

Then, each time a thread needs a new record is needed it will fetch from the database:

SELECT *
FROM record
WHERE state = 'unprocessed'
AND (id % $thread_count) = $bucket
ORDER BY date

There could of course be other assumptions made about reading the "this threads tasks" in batch and storing them locally. A local queue, however, would be per thread (and thus re-loaded upon a new thread startup) and thus it would only deal with records associated for the given bucket.

When the thread is finished processing a record should mark the record as processed using the appropriate isolation level and/or optimistic concurrency and proceed to the next record.

Upvotes: 0

Eric J.

Reputation: 150138

The most straightforward way to implement this requirement is to use the Task Parallel Library's

Parallel.ForEach (or Parallel.For).

Allow it to manage individual worker threads.

From experience, I would recommend the following:

Have an additional status "Processing"
Have a column in the database that indicates when a record was picked up for processing and a cleanup task / process that runs periodically looking for records that have been "Processing" for far too long (reset the status to "ready for processing).
Even though you don't want it, "being processed" will be essential to crash recovery scenarios (unless you can tolerate the same record being processed twice).

Alternatively

Consider using a transactional queue (MSMQ or Rabbit MQ come to mind). They are optimized for this very problem.

That would be my clear choice, having done both at massive scale.

Optimizing

If it takes a non-trivial amount of time to retrieve data from the database, you can consider a Producer/Consumer pattern, which is quite straightforward to implement with a BlockingCollection. That pattern allows one thread (producer) to populate a queue with DB records to be processed, and multiple other threads (consumers) to process items off of that queue.

A New Alternative

Given that several processing steps touch the record before it is considered complete, have a look at Windows Workflow Foundation as a possible alternative.

Upvotes: 6

Andrei Neagu

Reputation: 896

I remember doing something like what you described...A thread checks from time to time if there is something new in database that needs to be processed. It will load only the new ids, so if at time x last id read is 1000, at x+1 will read from id 1001.

Everything it reads goes into a thread safe Queue. When items are added to this queue, you notify the working threads (maybe use autoreset events, or spawn threads here). each thread will read from this thread safe queue one item at a time, until the queue is emptied.

You should not assign before the work foreach thread (unless you know that foreach file the process takes the same amount of time). if a thread finishes the work, then it should take the load from the other ones left. using this thread safe queue, you make sure of this.

Upvotes: 2

Options to use multithreading to process a group of database records?

Answers (3)

Related Questions