Reputation: 305

Deduping SQL Server table

I have an issue. I have a table with almost 2 billion rows (yeah I know...) and has a lot of duplicate data in it which I'd like to delete from it. I was wondering how to do that exactly?

The columns are: first, last, dob, address, city, state, zip, telephone and are in a table called PF_main. Each record does have a unique ID thankfully, and its in column called ID.

How can I dedupe this and leave 1 unique entry (row) within the pf_main table for each person??

Thank you all in advance for your responses...

Upvotes: 2

Answers (5)

SquarePowder

Reputation: 41

IMHO there's no one best way to dedupe, which is why you see so many different solutions. It depends on your situation. Now, I have a situation where I have a big history file, listing the monthly metrics for each of tens of thousands of Loan accounts for years and years. Each account is represented in the file for the many month-ends while it remains active, but when it becomes inactive, it will appear at no later date. I want only the last or latest records for each account. I don't care about the record when the account was opened 20 years ago, and all I care about is the latest record, when the account closed 5 years ago. For those accounts still active, I want the record for the most recent calendar month. So I think "duplicates" are those records for the same account for all but the last monthly record for that account, and I want to get rid of them.

This may not be your exact problem, but the solution I present may give you the boost you want to your own solution.

(I do most of my sql code in SAS PROC SQL, but I think you will get the idea.)

My solution uses a subquery...

 /* dedupe */ 
 proc sql ;   
   create table &delta. as       
     select distinct b.*, sub.mxasof
       from &bravo. b
       join ( select distinct Loan_Number, max(asof) as mxasof format 6.0
                from &bravo.
                group by Loan_Number
            ) sub
       on 1
       and b.Loan_Number = sub.Loan_Number
       and b.asof = sub.mxasof
       where 1
       order by b.Loan_Number, b.asof desc ;

Upvotes: 0

Gordon Linoff

Reputation: 1269973

A 2 billion row table is quite big. Let me assume that first, last, and dob constitutes a "person". My suggestion is to build an index on the "person" and then do the truncate/re-insert approach.

In practice, this looks like:

create index idx_pf_main_first_last_dob on pf_main(first, last, dob);

select m.*
into temp_pf_main
from pf_main m
where not exists (select 1
                  from pf_main m2
                  where m2.first = m.first and m2.last = m.last and m2.dob = m.dob and
                        m2.id < m.id
                 );

truncate table pf_main;

insert into pf_main
    select *
    from temp_pf_main;

Upvotes: 4

Mark Sowul

Reputation: 10600

SELECT 
   ID, first, last, dob, address, city, state, zip, telephone, 
   ROW_NUMBER() OVER (PARTITION BY first, last, dob, address, city, state, zip, telephone ORDER BY ID) AS RecordInstance
FROM PF_main

will give you the "number" of each unique entry (sorted by Id)

so if you have the following records:

id, last, first, dob, address, city, state, zip, telephone
006, trevelyan, alec, '1954-05-15', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0064
007, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0074
008, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074
009, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074

you will get the following results (note last column)

006, trevelyan, alec, '1954-05-15', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0064, 1
007, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0074, 1
008, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074, 2
009, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074, 3

So you can just delete records with RecordInstance > 1:

WITH Records AS
(
   SELECT 
      ID, first, last, dob, address, city, state, zip, telephone,
      ROW_NUMBER() OVER (PARTITION BY first, last, dob, address, city, state, zip, telephone ORDER BY ID) AS RecordInstance
   FROM PF_main
)
DELETE FROM Records
WHERE RecordInstance > 1

Upvotes: 9

mrjoltcola

Reputation: 20842

Other answers will certainly give you ideas as far as syntax.

With 2 billions rows, your concerns may involve other things besides syntax, so I will give you a generic answer that works on many databases. If you cannot afford to do the delete or copy "online" in one session, or are low on space, then consider the following incremental approach.

Deletes this large can take a long, long time to complete, as in hours or even days, and also risk failing before completion. In a few cases, the approach that worked the best, surprisingly, was a rudimentary, long-running stored procedure that took small batches and committed every few records (few being a relative term here). Few might be 100 or 1000 or 10000 records. Sure it doesn't look elegant, but the point is it is "incremental" and a low resource consumer.

The idea is to identify a partioning key by which you can address ranges of records (to partition your working set down), or either do an initial query to identify the duplicate keys to another table. Then iterate through those keys one small batch at a time, delete, then commit, and repeat. If you do it live without a temp table, make sure to keep the ranges small by adding appropriate criteria to reduce the result sets, and keep the cursor or sort area sizes small.

-- Pseudocode
counter = 0;
for each row in dup table       -- and if this takes long, break this into ranges
   delete from primary_tab where id = @id
   if counter++ > 1000 then
      commit;
      counter = 0;
   end if
end loop

This stored proc can be stopped and restarted without worry of a giant rollback, and it will also run reliably for hours or days without major impact on the database availability. In Oracle this could be undo segments and sort area sizes as well as other things, but in MSSQL I am not an expert. Eventually, it will finish. In the meantime you aren't tying up the target table with locks or large transactions, and so DML can continue on the table. The caveat is, if DML continues on it, then you will potentially have to repeat the snapshot to the dup ids table to handle any dups that arise since the snapshot.

Caveat: This doesn't defrag free blocks/rows or coalesce deleted space like completely building a new table, but it does allow it to be done online and without allocating a new copy. On the other hand, if you have the freedom to do it online and/or a maintenance window, and the duplicate rows are greater than say 15-20% of your data, then you should opt for the "create table as select * from original minus duplicates" approach, as in Gordon's answer, in order to compressed the data segment into a densely utilized, contiguous segment, and get better cache/IO performance in the long run. Rarely are duplicates more than a fraction of a percentage of the space, however.

Reasons for doing this include:

1 - Table too large to create a temporary de-duped copy.

2 - You can't or don't want to drop the original table to do the swap once the new one is ready.

3 - You can't get a maintenance window to do one giant operation.

Otherwise, see Gordon Linoff's answer.

Upvotes: 3

Ty H.

Reputation: 953

Everybody else has offered great methods on doing this on the technical side, I'll just add one pragmatic point.

IMHO it is difficult to completely automate the process of eliminating the duplicates in a large table of persons. If the matching is too relaxed... then legitimate records will be deleted. If the matching is too strict... then duplicates will be left in.

For my clients I built a query like the above that returns rows that represent LIKELY duplicates, using Address and Last Name as the matching criterion. Then they eyeball the list of likely's and click "Delete" or "Merge" on the rows they choose to be a duplicate.

This might not work in your project (with billions of rows), but it makes sense in an environment where both duplicates and lost data need to be avoided. A single human operator can improve the cleanliness of a several thousand rows of data in a few minutes, and they can do it a little at a time over many sessions.

Upvotes: 1

Deduping SQL Server table

Answers (5)

Related Questions