Remove partial duplicates from table with no unique column

Question

I have to clean up records from a table that doesn't have a primary key or a a unique constraint.

Table definition:

create table person(
    name text,
    staff_id integer,
    work_code text,
    location
);

Unsurprisingly, it contains a lot of duplicates and partial duplicates. What is the best way to transform the records to a unique set. I don't have to care about other columns besides name and staff_id

Erwin Brandstetter · Accepted Answer

As you

don't have to care about other columns besides name and staff_id

This could be your procedure to clean up the table:

1.) Create a temporary table of unique rows:

CREATE TEMP TABLE p_tmp AS
SELECT DISTINCT ON (name, staff_id)
       name, staff_id, work_code, location
FROM   person
ORDER  BY name, staff_id, work_code, location;

I arbitrarily pick the "first row per (name, staff_id) - minimum work_code and matching location.

2.) Empty table:

TRUNCATE person;

3.) Re-INSERT unique tuples:

INSERT INTO person SELECT * FROM p_tmp;

Make sure, dupes don't creep back in. Add a surrogate primary key:

ALTER TABLE person ADD COLUMN person_id serial PRIMARY KEY;
ALTER TABLE person ADD UNIQUE (name, staff_id);

Or just add a multi-column primary key:

ALTER TABLE person ADD PRIMARY KEY (name, staff_id);

The temporary table will be dropped at the end of the session automatically.

Of course, all of this is best done inside one transaction, so you don't lose anything in the unlikely case that you run into a problem half way. Some clients do that automatically for a batch of SQL statements executed at once.

Remove partial duplicates from table with no unique column

Answers (2)

Related Questions