Reputation: 21
I have a c# console application running on a 64-bit Windows 2008 r2 server which also hosts MSSQL Server 2005.
This application runs through text files, reads the line, splits the line values into variables, and inserts the data into a SQL database hosted at localhost.
Each Text file is a new thread, each line is a new thread, and each SQL insert statement is executed under a new thread.
I am counting the number of each of these types of threads and decrementing when they complete. I'm wondering what the best way is to "pend" future threads from opening...
For example.. before a new SQL insert thread is opened I'm calling...
while(numberofcurrentthreads > specifiednumberofthreads)
{
// wait
}
new.Thread(insertSQL);
Where specifiednumberofthreads has been estimated to a value that does not throw System.OutofMemoryExceptions. A lot of guess work has gone into determining that number for each process.
My questions is.. is there a more 'efficient' or proper way to do this? Is there a way to read System memory, not physical memory, and wait based on a specified resource allotment?
To illustrate this idea...
while(System.Memory < (System.Memory/2) || System.OutofMemory == true)
{
// wait
}
new.Thread(insertSQL);
The current method I am employing works and completes in a decent time.. but it could do better. Some of the text files going through the process are larger than others and do not necessarily make the best use of system resources...
In example, if I say process 2 text files at a time that works perfectly when both text files are < 300KB. It does not work so well if one or two are over 100,000KB.
There also seems to be a 'butter-zone' where things process most efficiently. Somewhere averaging around 75% of all CPU resources. Crank these values too high and it will run at 100% CPU but process way slower as it cannot keep up.
Upvotes: 2
Views: 1885
Reputation: 133995
It's crazy to be creating a new thread for every file and for every line and for every SQL insert statement. You'd probably be much better off using three threads and a chained producer-consumer model, all of which communicate through thread-safe queues. In C#, that would be BlockingCollection.
First, you set up two queues, one for lines that have been read from a text file, and one for lines that have been processed:
const int MaxQueueSize = 10000;
BlockingCollection<string> _lines = new BlockingCollection<string>(MaxQueueSize);
BlockingCollection<DataObject> _dataObjects = new BlockingCollection<DataObject>(MaxQueueSize);
DataObject
, by the way, is what I'm calling the object that you'll be inserting into the database. You don't say what that is. It doesn't really matter for the purposes of this discussion, but you'd replace it with whatever type you use to represent the processed string.
Now, you create three threads:
_lines
queue._lines
queue, processes it, and creates a DataObject
which it then places on the _dataObjects
queue._dataObjects
queue and inserts them into the database.Beyond simplicity (and this is very easy to put together), there are many benefits to this model.
First, having more than one thread reading from the disk concurrently usually leads to slower performance because the disk drive can only do one thing at a time. Having multiple threads hitting the disk at the same time just causes unnecessary head seeks. Just one thread will keep your input queue full.
Second, limiting the queues' sizes will prevent you from running out of memory. When the disk reading thread tries to insert the 10,001th item into the queue, it will wait until the processing thread removes an item. That's the "blocking" part of BlockingCollection
.
You might find that you can speed your SQL inserts by grouping them and sending a bunch of records at once, doing what is essentially a bulk insert of 100 or 1000 records at a time rather than sending 100 or 1000 individual transactions.
This solution prevents the problem of too many threads. You have a fixed number of threads, all of which are running as fast as they possibly can. And memory use is constrained by limiting the number of things that can be in the queues.
The solution also scales rather well. If you have files on multiple drives, you can add a second file reading thread to read the files from that other physical drive and places the lines in the same queue. BlockingCollection
supports multiple producers and multiple consumers, so adding another producer is no trouble at all.
The same goes for consumers. If you find that the processing step is the bottleneck, you can add another processing thread. It, too, will read from the _lines
queue and write to the dataObjects
queue.
However, having more threads than you have processor cores will likely make your program slower. If you have a four-core processor, creating 8 processing threads won't do you any good. It will make things slower because the operating system will be spending a lot of time on thread context switches rather than on doing useful work.
You'll have to do a little tuning to get the best performance. Queue sizes should be large enough to support continuous workflow (so no thread is starved of work, or spends too much time waiting for the output queue), but not so large to overfill memory. Depending on the relative speed of the three stages, one of the queues might have to be larger than the other. If one of the three stages is a bottleneck, you can add another thread to help at that stage.
I created a simple example of this model using text files for input and output. It should be pretty easy to extend for your situation. See Simple Multithreading, and the follow up, Part 2.
Upvotes: 5