Reputation: 51
I have about 60Million records in database and have to process all of them. So the idea is to use c# code to read data, process it and then put it back in the db. Data doesn't come and go to the same table - multiple tables are involved.
I want to see what's the best to go about doing that? Should I read 100K records at a time in dataset and then process each record and then use bulk insert to database and then read next set?
Upvotes: 5
Views: 11188
Reputation: 5212
Don't go anywhere near a DataSet nor a DataAdapter!
For getting the data down, use a DataReader - use SQL text or a Stored Proc invocation through an SqlCommand by calling ExecuteReader on it. Then you can retrieve the records from the DataReader one at a time, WITHOUT any of the object tracking baggage that comes with a DateSet, Entity Framework, or Linq to SQL, or NHibenate - all of these frameworks have layers added to allow you to do object and change tracking - which you don't need and will only be overhead for you.
When you write your results back to the Database, do so through SqlBulkCopy, with TableLock's enabled and with the Database Properties set with "Recovery Model" anything but "Full". Make sure constraints on the target table are disabled, and that no indexes are defined (drop then and recreate at the end if you need to).
SqlBulkCopy does its own batching when sending back to SQL Server, IF you make sure you specify a BatchSize (the default is everything in one batch). You will probably also want to set UseInternalTransaction on the SqlBulkCopy so that each batch is done in its own transaction - this will reduce transaction journal usage even further.
A reader and a writer thread may help, or may not, I've not profilled the difference. You may also need one or more processing threads, or some other meachnism, if the "third party process" you mention is time consuming.
It is possible to do this all on a single thread, one record at a time, and it can be quite fast (depending on the cost of the processing you're doing).
If you DO need to use multiple threads, DON'T swap single records between then, as you'll lose too many CPU cycles thread switching: chunk it into "reasonable" batches. "Reasonable" could be between 1k and 100k records depending on the record sizes and the processing your doing. May even be with giving Parallels a go at doing it for you.
Given that you say that multiple tables are involved, it may be that just spinning up one thread per source table might work well, with a lock around writing to the SqlBlukCopy object for synchronisation.
Upvotes: 4
Reputation: 52675
Typically the absolute fastest way is to do everything on the server in SQL batches.
If you insist on using a client then separate threads to read and write can be faster than using one to do both. How many threads to read and write will depend on the hardware and what you're doing
EDIT: Clarifying the approach.
Retrieving and sending the data to the sql server is both network IO bound and out of process. This means that on both reading and sending the data your application spends time waiting for data to go from disk over the network and into memory. Lets assume that it would take 1 hour to retrieve the data. 10 minutes to process and 1 hour to send the data back to the db. So your entire process would take 2 hours and 10 minutes.
If you split it up into three threads, 1 reader, 1 processor, 1 updater. You can get it down to close to 1 hour. If you write your application well you can add additional threads for reading, processing and writing but you might be disappointed by the results because of things like sharing of cache lines, how the network card responds to lots of concurrent requests etc.
Also when you use a DataAdapter to fill a dataset you can't touch any of the data until the fill is complete. If you a DataReader on the other hand you can start using the data when the first row is complete. This means you don't have to worry about limiting to 100K at a time.
Upvotes: 1