Reputation: 305
I have an issue. I have a table with almost 2 billion rows (yeah I know...) and has a lot of duplicate data in it which I'd like to delete from it. I was wondering how to do that exactly?
The columns are: first, last, dob, address, city, state, zip, telephone and are in a table called PF_main
. Each record does have a unique ID thankfully, and its in column called ID
.
How can I dedupe this and leave 1 unique entry (row) within the pf_main
table for each person??
Thank you all in advance for your responses...
Upvotes: 2
Views: 11357
Reputation: 41
IMHO there's no one best way to dedupe, which is why you see so many different solutions. It depends on your situation. Now, I have a situation where I have a big history file, listing the monthly metrics for each of tens of thousands of Loan accounts for years and years. Each account is represented in the file for the many month-ends while it remains active, but when it becomes inactive, it will appear at no later date. I want only the last or latest records for each account. I don't care about the record when the account was opened 20 years ago, and all I care about is the latest record, when the account closed 5 years ago. For those accounts still active, I want the record for the most recent calendar month. So I think "duplicates" are those records for the same account for all but the last monthly record for that account, and I want to get rid of them.
This may not be your exact problem, but the solution I present may give you the boost you want to your own solution.
(I do most of my sql code in SAS PROC SQL, but I think you will get the idea.)
My solution uses a subquery...
/* dedupe */
proc sql ;
create table &delta. as
select distinct b.*, sub.mxasof
from &bravo. b
join ( select distinct Loan_Number, max(asof) as mxasof format 6.0
from &bravo.
group by Loan_Number
) sub
on 1
and b.Loan_Number = sub.Loan_Number
and b.asof = sub.mxasof
where 1
order by b.Loan_Number, b.asof desc ;
Upvotes: 0
Reputation: 1269973
A 2 billion row table is quite big. Let me assume that first
, last
, and dob
constitutes a "person". My suggestion is to build an index on the "person" and then do the truncate
/re-insert approach.
In practice, this looks like:
create index idx_pf_main_first_last_dob on pf_main(first, last, dob);
select m.*
into temp_pf_main
from pf_main m
where not exists (select 1
from pf_main m2
where m2.first = m.first and m2.last = m.last and m2.dob = m.dob and
m2.id < m.id
);
truncate table pf_main;
insert into pf_main
select *
from temp_pf_main;
Upvotes: 4
Reputation: 10600
SELECT
ID, first, last, dob, address, city, state, zip, telephone,
ROW_NUMBER() OVER (PARTITION BY first, last, dob, address, city, state, zip, telephone ORDER BY ID) AS RecordInstance
FROM PF_main
will give you the "number" of each unique entry (sorted by Id)
so if you have the following records:
id, last, first, dob, address, city, state, zip, telephone
006, trevelyan, alec, '1954-05-15', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0064
007, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0074
008, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074
009, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074
you will get the following results (note last column)
006, trevelyan, alec, '1954-05-15', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0064, 1
007, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0074, 1
008, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074, 2
009, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074, 3
So you can just delete records with RecordInstance > 1:
WITH Records AS
(
SELECT
ID, first, last, dob, address, city, state, zip, telephone,
ROW_NUMBER() OVER (PARTITION BY first, last, dob, address, city, state, zip, telephone ORDER BY ID) AS RecordInstance
FROM PF_main
)
DELETE FROM Records
WHERE RecordInstance > 1
Upvotes: 9
Reputation: 20842
Other answers will certainly give you ideas as far as syntax.
With 2 billions rows, your concerns may involve other things besides syntax, so I will give you a generic answer that works on many databases. If you cannot afford to do the delete or copy "online" in one session, or are low on space, then consider the following incremental approach.
Deletes this large can take a long, long time to complete, as in hours or even days, and also risk failing before completion. In a few cases, the approach that worked the best, surprisingly, was a rudimentary, long-running stored procedure that took small batches and committed every few records (few being a relative term here). Few might be 100 or 1000 or 10000 records. Sure it doesn't look elegant, but the point is it is "incremental" and a low resource consumer.
The idea is to identify a partioning key by which you can address ranges of records (to partition your working set down), or either do an initial query to identify the duplicate keys to another table. Then iterate through those keys one small batch at a time, delete, then commit, and repeat. If you do it live without a temp table, make sure to keep the ranges small by adding appropriate criteria to reduce the result sets, and keep the cursor or sort area sizes small.
-- Pseudocode
counter = 0;
for each row in dup table -- and if this takes long, break this into ranges
delete from primary_tab where id = @id
if counter++ > 1000 then
commit;
counter = 0;
end if
end loop
This stored proc can be stopped and restarted without worry of a giant rollback, and it will also run reliably for hours or days without major impact on the database availability. In Oracle this could be undo segments and sort area sizes as well as other things, but in MSSQL I am not an expert. Eventually, it will finish. In the meantime you aren't tying up the target table with locks or large transactions, and so DML can continue on the table. The caveat is, if DML continues on it, then you will potentially have to repeat the snapshot to the dup ids table to handle any dups that arise since the snapshot.
Caveat: This doesn't defrag free blocks/rows or coalesce deleted space like completely building a new table, but it does allow it to be done online and without allocating a new copy. On the other hand, if you have the freedom to do it online and/or a maintenance window, and the duplicate rows are greater than say 15-20% of your data, then you should opt for the "create table as select * from original minus duplicates" approach, as in Gordon's answer, in order to compressed the data segment into a densely utilized, contiguous segment, and get better cache/IO performance in the long run. Rarely are duplicates more than a fraction of a percentage of the space, however.
Reasons for doing this include:
1 - Table too large to create a temporary de-duped copy.
2 - You can't or don't want to drop the original table to do the swap once the new one is ready.
3 - You can't get a maintenance window to do one giant operation.
Otherwise, see Gordon Linoff's answer.
Upvotes: 3
Reputation: 953
Everybody else has offered great methods on doing this on the technical side, I'll just add one pragmatic point.
IMHO it is difficult to completely automate the process of eliminating the duplicates in a large table of persons. If the matching is too relaxed... then legitimate records will be deleted. If the matching is too strict... then duplicates will be left in.
For my clients I built a query like the above that returns rows that represent LIKELY duplicates, using Address and Last Name as the matching criterion. Then they eyeball the list of likely's and click "Delete" or "Merge" on the rows they choose to be a duplicate.
This might not work in your project (with billions of rows), but it makes sense in an environment where both duplicates and lost data need to be avoided. A single human operator can improve the cleanliness of a several thousand rows of data in a few minutes, and they can do it a little at a time over many sessions.
Upvotes: 1