Reputation: 621
I'm importing more than 600.000.000 rows from an old database/table that has no primary key set, this table is in a sql server 2005 database. I created a tool to import this data into a new database with a very different structure. The problem is that I want to resume the process from where it stopped for any reason, like an error or network error. As this table doesn't have a primary key, I can't check if the row was already imported or not. Does anyone know how to identify each row so I can check if it was already imported or not? This table has duplicated row, I already tried to compute the hash of all the columns, but it's not working due to duplicated rows...
thanks!
Upvotes: 6
Views: 8914
Reputation: 96648
I would bring the rows into a staging table if this is coming from another database -- one that has an identity set on it. Then you can identify the rows where all the other data is the same except for the id and remove the duplicates before trying to put it into your production table.
Upvotes: 5
Reputation: 11908
Try to find a somewhat unique field combination. (duplicates are allowed) and join over a hash of the rest of the fields which you store in the destination table.
Assume a table:
create table t_x(id int, name varchar(50), description varchar(100))
create table t_y(id int, name varchar(50), description varchar(100), hash varbinary(8000))
select * from t_x x
where not exists(select *
from t_y y
where x.id = y.id
and hashbytes('sha1', x.name + '~' + x.description) = y.hash)
The reason to try to join as many fields as possible is to reduce the chance of hash collisions which are real on a dataset with 600.000.000 records.
If you really need the duplicate rows you should add a unique id column to your big table. To achieve this in a performing way you should do the following steps:
Upvotes: 0
Reputation: 40359
So: you are loading umpteen bazillion rows of data, the rows cannot be uniquely identified, the load can (and, apparently, will) be interrupted at any point at any time, and you want to be able to resume such an interrupted load from where you left off, despite the fact that for all practical purposes you cannot identify where you left off. Ok.
Loading into a table containing an additional identity column would work, assuming that however and whenever the data load is started, it always starts at the same item and loads items in the same order. Wildly inefficient, since you have to read through everythign every time you launch.
Another clunky option would be to first break the data you are loading into manageably-sized chunks (perhaps 10,000,000 rows). Load them chunk by chunk, keeping track of which chunk you have loaded. Use a Staging table, so that you know and can control when a chunk has been "fully processed". If/when interrupted, you've only toss the chunk you were working on when interrupted, and resume work with that chunk.
Upvotes: 1
Reputation: 95243
With duplicate rows, even row_number()
is going to get you nowhere, as that can change between queries (due to the way MSSQL stores data). You need to either bring it into a landing table with an identity column or add a new column with an identity onto the existing table (alter table oldTbl add column NewId int identity(1,1)
).
You could use row_number()
, and then back out the last n
rows if they have more than the count in the new database for them, but it would be more straight-forward to just use a landing table.
Upvotes: 0