UserBSS1
UserBSS1

Reputation: 2221

Remove Duplicate rows from a large table - PostgreSQL

I want to remove duplicates from a large table having about 1million rows and increasing every hour. It has no unique id and has about ~575 columns but sparsely filled.

The table is 'like' a log table where new entries are appended every hour without unique timestamp.

The duplicates are like 1-3% but I want to remove it anyway ;) Any ideas?

I tried ctid column (as here) but its very slow.

Upvotes: 4

Views: 4221

Answers (1)

Daniel Vérité
Daniel Vérité

Reputation: 61606

The basic idea that works generally well with PostgreSQL is to create an index on the hash of the set of columns as a whole.

Example:

CREATE INDEX index_name ON tablename (md5((tablename.*)::text));

This will work unless there are columns that don't play well with the requirement of immutability (mostly timestamp with time zone because their cast-to-text value is session-dependent).

Once this index is created, duplicates can be found quickly by self-joining with the hash, with a query looking like this:

SELECT t1.ctid, t2.ctid
FROM tablename t1 JOIN tablename t2
 ON (md5((t1.*)::text) = md5((t2.*)::text))
WHERE t1.ctid > t2.ctid;

You may also use this index to avoid duplicates rows in the future rather than periodically de-duplicating them, by making it UNIQUE (duplicate rows would be rejected at INSERT or UPDATE time).

Upvotes: 4

Related Questions