Multithreaded application with database read - each thread unique records

Question

I have a .net application which basically reads about a million of records from database table each time (every 5 minutes), does some processing and updates the table marking the records as processed.

Currently the application runs in single thread taking about top 4K records from DB table, processes it, updates the records, and takes the next.

I'm using dapper with stored procedures. I'm using 4K records for retrieval to avoid DB table locks.

What would be the most optimal way for retrieving records in multiple threads and at the same time ensuring that each thread gets a new 4K records?

My current idea is that i would first just retrieve the ids of the 1M records. Sort the ids by ascending, and split them into 4K batches remembering lowest and highest id in a batch. Then in each thread i would call another stored procedure which would retrieve full records by specifying the lowest and highest ids of records retrieved, process that and so on.

Is there any better pattern i'm not aware of?

Mavi Domates · Accepted Answer

I find this problem interesting partly because I'm attempting to do something similar in principle but also because I haven't seen a super intuitive industry standard solution to it. Yet.

What you are proposing to do would work if you write your SQL query correctly. Using ROW_NUMBER / BETWEEN it should be achievable. I'll write and document some other alternatives here along with benefits / caveats.

Parallel processing

I understand that you want to do this in SQL Server, but just as a reference, Oracle implemented this as a keyword which you can query stuff in parallel.

Documentation: https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm

SQL implements this differently, you have to explicitly turn it on through a more complex keyword and you have to be on a certain version:

A nice article on this is here: https://www.mssqltips.com/sqlservertip/4939/how-to-force-a-parallel-execution-plan-in-sql-server-2016/

You can combine the parallel processing with SQL CLR integration, which would effectively do what you're trying to do in SQL while SQL manages the data chunks and not you in your threads.

SQL CLR integration

One nice feature that you might look into is executing .net code in SQL server. Documentation here: https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/introduction-to-sql-server-clr-integration

This would basically allow you to run C# code in your SQL server - saving you the read / process / write roundtrip. They have improved the continuous integration regarding to this as well - documentation here: https://learn.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017

Reviewing the QoS / getting the logs in case something goes wrong is not really as easy as handling this in a worker-job though unfortunately.

Use a single thread (if you're reading from an external source)

Parallelism is only good for you if certain conditions are met. Below is from Oracle's documentation but it also applies to MSSQL: https://docs.oracle.com/cd/B19306_01/server.102/b14223/usingpe.htm#DWHSG024

Parallel execution improves processing for:

Queries requiring large table scans, joins, or partitioned index scans

Creation of large indexes

Creation of large tables (including materialized views)

Bulk inserts, updates, merges, and deletes

There are also setup / environment requirements

Parallel execution benefits systems with all of the following characteristics:

Symmetric multiprocessors (SMPs), clusters, or massively parallel systems

Sufficient I/O bandwidth

Underutilized or intermittently used CPUs (for example, systems where CPU usage is typically less than 30%)

Sufficient memory to support additional memory-intensive processes, such as sorts, hashing, and I/O buffers

There are other constraints. When you are using multiple threads to do the operation that you propose, if one of those threads gets killed / failed to do something / throws an exception etc... you will absolutely need to handle that - in a way that you keep until what's the last index that you've processed - so you could retry the rest of the records. With a single thread that becomes way simpler.

Conclusion

Assuming that the DB is modeled correctly and couldn't be optimized even further I'd say the simplest solution, single thread is the best one. Easier to log and track the errors, easier to implement retry logic and I'd say those far outweigh the benefits you would see from the parallel processing. You might look into parallel processing bit for the batch updates that you'll do to the DB, but unless you're going to have a CLR DLL in the SQL - which you will invoke the methods of it in a parallel fashion, I don't see overcoming benefits. Your system will have to behave a certain way as well at the times that you're running the parallel query for it to be more efficient.

You can of course design your worker-role to be async and not block each record processing. So you'll be still multi-threaded but your querying would happen in a single thread.

Edit to conclusion

After talking to my colleague on this today, it's worth adding that with even with the single thread approach, you'd have to be able to recover from failure, so in principal having multiple threads vs single thread in terms of the requirement of recovery / graceful failure and remembering what you processed doesn't change. How you recover would though, given that you'd have to write more complex code to track your multiple threads and their states.

Multithreaded application with database read - each thread unique records

Answers (1)

Parallel processing

SQL CLR integration

Use a single thread (if you're reading from an external source)

Conclusion

Related Questions