Reputation: 34

How to remove duplicate rows with exact match in two columns?

I have table text_table in PostgreSQL 12 database with only 2 columns: text1 and text2.
There is an idx_text1 index on the text1 column.

For example:

       text1    text2
----------------------
   1   str1     one
   2   str2     two
   3   str3     three
   4   str2     two
   5   str1     seven
   6   str4     seven
   7   str3     three

How do I remove all duplicate rows (with exact match of values in both columns)?

In the example above, rows 4 and 7 (or 2 and 3) need to be deleted from the table.
I would like to find the fastest way with an SQL query.

Upvotes: 1

Answers (2)

Erwin Brandstetter

Reputation: 656371

The best performance depends on the size of the table, concurrent activity, the number and avg. size of rows, and most importantly the number of rows to delete.

For only few duplicates, no NULL values (or you don't consider those duplicates), and no PRIMARY KEY, this would do just fine:

DELETE FROM text_table t
WHERE  EXISTS (
   SELECT FROM  text_table 
   WHERE (text1, text2) = (t.text1, t.text2)
   AND   ctid < t.ctid  -- exclude self
-- AND   pk < t.pk      -- alternative with PK
   );

Among duplicates, this keeps the row with the smallest ctid (the first one, physically). (Or the one with the smallest PK value with the alt. syntax.)

About ctid:

Your index on (text1) should help. An index on (text1, text2) would typically help some more, unless strings are very long. (OTOH, when deleting a substantial percentage of all rows, indexes may generate more cost than benefit.)

And consider adding a UNIQUE index or constraint after removal of duplicates to defend against re-introducing more of those.

Upvotes: 1

GMB

Reputation: 222422

Without a primary key, such task is tedious. I think the simplest approach may be to backup/restore, like so:

create table tmptable as select distinct text1, text2 from mytable;
truncate table mytable;  -- back it up first!!
insert into mytable select * from tmptable;
drop table tmptable;

Upvotes: 1

How to remove duplicate rows with exact match in two columns?

Answers (2)

Related Questions