sooprise
sooprise

Reputation: 23187

Removing duplicates before inserting into database

Let's say I have a temp table of data I want to put into another table, but before doing so, I want to remove any rows in the temp table that already exist in the destination table.

I'm thinking of using a left outer join to identify which rows are duplicates, but then how can I go and delete these rows from my temp table?

Am I taking the right approach, or would it be better to remove duplicates after transferring the data?

Upvotes: 2

Views: 2812

Answers (1)

anon
anon

Reputation:

Assuming there is an index on col1/col2, this may perform better than trying to weed out duplicates in the process of the insert:

DELETE t
FROM #temp AS t
WHERE NOT EXISTS
(
  SELECT 1 FROM dbo.table
    WHERE col1 = t.col1
    AND col2 = t.col2
);

An equivalent to Jack's approach would be:

INSERT dbo.table(col1, col2)
SELECT col1, col2
FROM #temp AS t
WHERE NOT EXISTS
(
  SELECT 1 FROM dbo.table
    WHERE col1 = t.col1
    AND col2 = t.col2
);

I'm going to assume you have a unique index on the #temp table so you don't need DISTINCT or GROUP BY. You can also delete duplicates from the #temp table as a separate step first, e.g.

;WITH t AS 
(
  SELECT col1, col2, 
    rn = ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1)
)
DELETE t WHERE rn > 1;

Whether you're doing the right thing or not, no idea. Will require a lot more information and testing to determine which approach is best. You're going to have to scan 2+ million rows for duplicates no matter which method you end up using.

Upvotes: 2

Related Questions