Reputation: 2221
I want to remove duplicates from a large table having about 1million rows and increasing every hour. It has no unique id and has about ~575 columns but sparsely filled.
The table is 'like' a log table where new entries are appended every hour without unique timestamp.
The duplicates are like 1-3% but I want to remove it anyway ;) Any ideas?
I tried ctid column (as here) but its very slow.
Upvotes: 4
Views: 4221
Reputation: 61606
The basic idea that works generally well with PostgreSQL is to create an index on the hash of the set of columns as a whole.
Example:
CREATE INDEX index_name ON tablename (md5((tablename.*)::text));
This will work unless there are columns that don't play well with the requirement of immutability (mostly timestamp with time zone
because their cast-to-text value is session-dependent).
Once this index is created, duplicates can be found quickly by self-joining with the hash, with a query looking like this:
SELECT t1.ctid, t2.ctid
FROM tablename t1 JOIN tablename t2
ON (md5((t1.*)::text) = md5((t2.*)::text))
WHERE t1.ctid > t2.ctid;
You may also use this index to avoid duplicates rows in the future rather than periodically de-duplicating them, by making it UNIQUE (duplicate rows would be rejected at INSERT or UPDATE time).
Upvotes: 4