Reputation: 34
I have table text_table
in PostgreSQL 12 database with only 2 columns: text1
and text2
.
There is an idx_text1
index on the text1
column.
For example:
text1 text2
----------------------
1 str1 one
2 str2 two
3 str3 three
4 str2 two
5 str1 seven
6 str4 seven
7 str3 three
How do I remove all duplicate rows (with exact match of values in both columns)?
In the example above, rows 4 and 7 (or 2 and 3) need to be deleted from the table.
I would like to find the fastest way with an SQL query.
Upvotes: 1
Views: 1200
Reputation: 656371
The best performance depends on the size of the table, concurrent activity, the number and avg. size of rows, and most importantly the number of rows to delete.
For only few duplicates, no NULL values (or you don't consider those duplicates), and no PRIMARY KEY
, this would do just fine:
DELETE FROM text_table t
WHERE EXISTS (
SELECT FROM text_table
WHERE (text1, text2) = (t.text1, t.text2)
AND ctid < t.ctid -- exclude self
-- AND pk < t.pk -- alternative with PK
);
Among duplicates, this keeps the row with the smallest ctid
(the first one, physically). (Or the one with the smallest PK value with the alt. syntax.)
About ctid
:
Your index on (text1)
should help. An index on (text1, text2)
would typically help some more, unless strings are very long. (OTOH, when deleting a substantial percentage of all rows, indexes may generate more cost than benefit.)
And consider adding a UNIQUE
index or constraint after removal of duplicates to defend against re-introducing more of those.
Related:
Upvotes: 1
Reputation: 222422
Without a primary key, such task is tedious. I think the simplest approach may be to backup/restore, like so:
create table tmptable as select distinct text1, text2 from mytable;
truncate table mytable; -- back it up first!!
insert into mytable select * from tmptable;
drop table tmptable;
Upvotes: 1