Angshuman Agarwal
Angshuman Agarwal

Reputation: 4866

Read 5 million records and finally update a column

I have to update 5 million+ records in a Database for a table T1. This is a C# tool which will READ (Select) a column in the table T1, say, T1.col1, then extract a value based on a logic from that column and finally have to UPDATE another column T1.col2 in the same table with this processed value and update the Db.

Wanted some opinions on the best/optimised way to achieve this in C# / ADO.NET ?

NOTE: The extraction logic cannot be part of SQL. That logic is embedded in a COM DLL which I am interoping from .NET and applying on the column Col1's value to generate a new value which has to be finally saved in T1.Col2.

Upvotes: 3

Views: 2388

Answers (7)

Yahia
Yahia

Reputation: 70379

Since you need to transfer tha data for some operation by a COM object this is what I would do:

Use a machine with lots of memory - Load the data in chunks (for example 5000 or 50000 rows at a time) into memory, process it and do the update on the SQL Server...

For the UPDATE part use transactions and put 5000 - 20000 UPDATEs into one transaction...

[EDIT]: by partitioning the work properly and assigning for 500000 or 1000000 rows to one "worker-machine" you can speed this up to the max limit of your SQL Server... [/EDIT]

Another option - though not recommended (only because of theoretically possible security and/or stability issues introduced by the COM object in this specific case):

Though this is a desciption regarding SQL Server something similar is possible with Oracle on Windows too

You can put the logic of this transformation into your SQL Server by writing+installing a .NET assembly which exposes a Stored Procedure you can call to do the transformation... the .NET assembly in turn access that COM object... for a howto see http://www.sqlteam.com/article/writing-clr-stored-procedures-in-charp-introduction-to-charp-part-1

The MSDN reference link to this is http://msdn.microsoft.com/en-us/library/ms131094.aspx

Upvotes: 1

Jon Raynor
Jon Raynor

Reputation: 3892

If this is a one time deal, then dump your 5 million + records to a file. Run the logic against the file to generate the new records. The dump and file logic should be fast and not that time consuming. Then BULK insert the updated data into a staging table.

At that point, deprecate the previous table and make the staging table the real table with some DDL statements to put in the appropriate indexes, FKs, etc.

That would be the fastest way to handle this amount of records. Anything else and its probably going to take at least a few days to process everything.

Upvotes: 0

Chris Shain
Chris Shain

Reputation: 51359

Some basic pointers:

  1. Use a DataReader, not a DataSet. The memory overhead of a DataSet will probably make trouble with this many rows
  2. If possible, run the computation portion in several threads in parallel. You can do that using the TPL for that, but because you are using a COM component there may be some concerns with accessing it from multiple threads. Consult a COM expert (or open another SO question) on how to determine whether your COM component is thread safe.
  3. Do not keep a single massive transaction open while computing the result. Use the "with(nolock)" hint if appropriate for your semantics. This will help to prevent your tasks from impacting other readers/writers.

Upvotes: 0

Michael Sagalovich
Michael Sagalovich

Reputation: 2549

When performing these operations, do you want user to wait until they end? Or can you run this task in background? Maybe even nightly during some scheduled processing? If the latter is true, you can live with your approach and simply perform all your millions of updates, just make sure you do not make user wait for the opertaion to complete. Otherwise, you really need to think off a way to port all update logic into the database.

So depending on the requirements, the options would be:

1) Start a separate thread (or even several threads) to perform updates, while the main thread would return to the user something like "Lengthy operation is runnig. Please come back to this page later to see it's status"

2) Run the updates nightly in a separate process

3) Reorganize the project so that you can afford updates to be performed in the database.

UPDATE

I see you constantly saying that you cannot port your update logic into the database. If you are not responsible for the project architecture, could you influence the management to reorganize the whole stuff? It looks like a poor architecture, if such operations are needed.

Upvotes: 0

Charles Lambert
Charles Lambert

Reputation: 5132

That's a lot of data to have in memory all at once. I would recommend if possible to pull the data in smaller batches of records from the COM DLL and process them. Using PLinq to objects would allow you to maximize processor usage. Between the two you should be able to find a happy medium that works well.

Upvotes: 0

gjvdkamp
gjvdkamp

Reputation: 10516

Do you really need to update col2 with the new value?

If every row gets the smmae value I would write it into a seperate table with 1 row and just cross join that row in when you need to return the result.

like this:

update t2 
set col2 = 1234 -- the computed value over all rows in t1

select t1.col1, 
       t2.col2 
from   t1
       cross join t2 -- t2 only has 1 row

Updates are relatively expensive, and writing 1 row is defenitely much cheaper than writing 5 million.

Else I would put the mill where the wood is, so use TSQL if possible. Then again 5 million is not that big of a problem, can you process them on the server or do you need to drag them accross the network? In the last case it does add up.

Rgds Gert-Jan

Upvotes: 0

BradC
BradC

Reputation: 39986

By far the fastest way would be to perform the update in the SQL code itself:

UPDATE T1
SET col2 = [some function based on col1]

(Be aware that depending on the database platform, this could explode the transaction log. For MS SQL, specifically, I'd recommend you update in smaller batches, maybe 100k rows or less at a time)

If the function logic is too convoluted for that, then make sure you are issuing all your 5 million updates based on the primary key:

UPDATE T1
SET col2 = @newval
WHERE tableID = @id

Upvotes: 0

Related Questions