Reputation: 4866
I have to update 5 million+ records in a Database for a table T1. This is a C# tool
which will READ (Select)
a column in the table T1, say, T1.col1
, then extract a value based on a logic from that column and finally have to UPDATE
another column T1.col2
in the same table with this processed value and update the Db.
Wanted some opinions on the best/optimised way to achieve this in C# / ADO.NET ?
NOTE: The extraction logic cannot be part of SQL. That logic is embedded in a COM DLL which I am interoping from .NET and applying on the column Col1's value to generate a new value which has to be finally saved in T1.Col2.
Upvotes: 3
Views: 2388
Reputation: 70379
Since you need to transfer tha data for some operation by a COM object this is what I would do:
Use a machine with lots of memory - Load the data in chunks (for example 5000 or 50000 rows at a time) into memory, process it and do the update on the SQL Server...
For the UPDATE part use transactions and put 5000 - 20000 UPDATEs into one transaction...
[EDIT]: by partitioning the work properly and assigning for 500000 or 1000000 rows to one "worker-machine" you can speed this up to the max limit of your SQL Server... [/EDIT]
Another option - though not recommended (only because of theoretically possible security and/or stability issues introduced by the COM object in this specific case):
Though this is a desciption regarding SQL Server something similar is possible with Oracle on Windows too
You can put the logic of this transformation into your SQL Server by writing+installing a .NET assembly which exposes a Stored Procedure you can call to do the transformation... the .NET assembly in turn access that COM object... for a howto see http://www.sqlteam.com/article/writing-clr-stored-procedures-in-charp-introduction-to-charp-part-1
The MSDN reference link to this is http://msdn.microsoft.com/en-us/library/ms131094.aspx
Upvotes: 1
Reputation: 3892
If this is a one time deal, then dump your 5 million + records to a file. Run the logic against the file to generate the new records. The dump and file logic should be fast and not that time consuming. Then BULK insert the updated data into a staging table.
At that point, deprecate the previous table and make the staging table the real table with some DDL statements to put in the appropriate indexes, FKs, etc.
That would be the fastest way to handle this amount of records. Anything else and its probably going to take at least a few days to process everything.
Upvotes: 0
Reputation: 51359
Some basic pointers:
Upvotes: 0
Reputation: 2549
When performing these operations, do you want user to wait until they end? Or can you run this task in background? Maybe even nightly during some scheduled processing? If the latter is true, you can live with your approach and simply perform all your millions of updates, just make sure you do not make user wait for the opertaion to complete. Otherwise, you really need to think off a way to port all update logic into the database.
So depending on the requirements, the options would be:
1) Start a separate thread (or even several threads) to perform updates, while the main thread would return to the user something like "Lengthy operation is runnig. Please come back to this page later to see it's status"
2) Run the updates nightly in a separate process
3) Reorganize the project so that you can afford updates to be performed in the database.
UPDATE
I see you constantly saying that you cannot port your update logic into the database. If you are not responsible for the project architecture, could you influence the management to reorganize the whole stuff? It looks like a poor architecture, if such operations are needed.
Upvotes: 0
Reputation: 5132
That's a lot of data to have in memory all at once. I would recommend if possible to pull the data in smaller batches of records from the COM DLL and process them. Using PLinq to objects would allow you to maximize processor usage. Between the two you should be able to find a happy medium that works well.
Upvotes: 0
Reputation: 10516
Do you really need to update col2 with the new value?
If every row gets the smmae value I would write it into a seperate table with 1 row and just cross join that row in when you need to return the result.
like this:
update t2
set col2 = 1234 -- the computed value over all rows in t1
select t1.col1,
t2.col2
from t1
cross join t2 -- t2 only has 1 row
Updates are relatively expensive, and writing 1 row is defenitely much cheaper than writing 5 million.
Else I would put the mill where the wood is, so use TSQL if possible. Then again 5 million is not that big of a problem, can you process them on the server or do you need to drag them accross the network? In the last case it does add up.
Rgds Gert-Jan
Upvotes: 0
Reputation: 39986
By far the fastest way would be to perform the update in the SQL code itself:
UPDATE T1
SET col2 = [some function based on col1]
(Be aware that depending on the database platform, this could explode the transaction log. For MS SQL, specifically, I'd recommend you update in smaller batches, maybe 100k rows or less at a time)
If the function logic is too convoluted for that, then make sure you are issuing all your 5 million updates based on the primary key:
UPDATE T1
SET col2 = @newval
WHERE tableID = @id
Upvotes: 0