Reputation: 52942
I have a database with 2,200,000 records in. I do some work in a c# program which re-calculates the data for these 2,200,000 records and I want to insert it but not have any duplicates (i.e. if it generated the identical same data for a record, don't do anything). There are 9 fields that I check to see if they are the same, and if there is no difference (identical data) I don't do an insert.
I'm trying to think of a speedy way to do this. All I can think of is having a unique constraint across all 9 fields, but I feel this might be the wrong approach.
Previously I loaded the entire lot into the memory and did the comparison there, but since it has gone up to 2 mil+ records this has become impractical.
Any thoughts?
Bit more detail:
A service runs that collects some scientific data. If the data meets some criteria (not important) then it inserts a warning record into the database. There are 2 million warnings generated the first time the program runs.
The second time it runs, it collects the same data. But I only want records to be inserted if they are different to existing records. So I must compared the previous 2 million warnings with the 2 million new warnings. If they are the same, nothing happens. If they are different in any way, they are inserted as a new warning.
A scientist checks every warning one by one and flags them as important or not important. So they don't care about duplicate warnings which are exactly the same.
The data has 9 fields, some string some integer, and if any field changes at all, it must insert a new warning.
Questions:
Currently, the warnings are calculated one by one in a loop, then afterwards, inserted into the database in another loop.
A hash sounds like it might work, how can I create a hash of all the fields? Is this a c# manual operation or can I get SQL to do it automatically? (I will have to generate the same hash in code to stop duplicates I think, so I need to be able to create it in c# too).
Yes, if previous version is identical to current version then do not insert, e.g. I have the data "1, 125, abcdef, 33.4, chocolate" if I process a record and get "1, 125, abcdef, 33.4, chocolate" do not insert, if I get "1, 125, abcdef, 33.4, melon" then insert.
Upvotes: 0
Views: 94
Reputation: 7005
Use hashes.
I had a similar issue while implementing "Delta Imports" for a DWH I have at a customer. The easiest and fastest way is to compare the hashes of both data sets. Each record has a uid (uniqueidentifier)
and a hash (nvarchar(400))
column. The hash is calculated just before inserting it into the respective table in the database.
With each new import (i.e. calculation in your scenario), the hash will be calculated again and compared to the existing hash (with the same uid
). The record will be skipped it both match and will be updated if both differ.
Upvotes: 1