phaebz
phaebz

Reputation: 423

Remove duplicate rows ERROR: duplicate key value

I want to remove duplicate rows from a table measurement in a PostgreSQL 9.1 data base.

Some table information:

select column_name, data_type from information_schema.columns where table_name = 'measurement';

 column_name | data_type 
-------------+-----------
 s_sum       | real
 s_l3        | real
 s_l2        | real
 s_l1        | real
 q_sum       | real
 q_l3        | real
 q_l2        | real
 q_l1        | real
 p_sum       | real
 p_l3        | real
 p_l2        | real
 p_l1        | real
 irms_n      | real
 irms_l3     | real
 irms_l2     | real
 irms_l1     | real
 urms_l3     | real
 urms_l2     | real
 urms_l1     | real
 timestamp   | integer
 site        | integer
 id          | integer
(22 rows)

and

select count(*) from measurement;

  count   
----------
 56265678
(1 row)

So what I want to do is to remove duplicate rows where all columns except id are equal. I went ahead and tried this with the approach in this answer.

SET temp_buffers = '1GB';

BEGIN;

CREATE TEMPORARY TABLE t_tmp AS
SELECT DISTINCT site,
            timestamp,
            urms_l1,
            urms_l2,
            urms_l3,
            irms_l1,
            irms_l2,
            irms_l3,
            irms_n,
            p_l1,
            p_l2,
            p_l3,
            p_sum,
            q_l1,
            q_l2,
            q_l3,
            q_sum,
            s_l1,
            s_l2,
            s_l3,
            s_sum
FROM measurement;

TRUNCATE measurement;

INSERT INTO measurement 
SELECT * FROM t_tmp;

COMMIT;

where the echo / error is:

SET
BEGIN
SELECT 56103537
TRUNCATE TABLE
ERROR:  duplicate key value violates unique constraint "measurement_pkey"
DETAIL:  Key (id)=(1) already exists.
ROLLBACK

so it looks as if it would remove the duplicates alright (compare with number of rows of original table measurement above) but then a primary key constraint is violated. I do not really know what is going on here, I assume that the INSERT is not operating on the truncated table...

Update:

The requested sql schema is as follows:

--
-- PostgreSQL database dump
--

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;

--
-- Name: plpgsql; Type: EXTENSION; Schema: -; Owner: -
--

CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog;


--
-- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner: -
--

COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language';


SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: measurement; Type: TABLE; Schema: public; Owner: -; Tablespace: 
--

CREATE TABLE measurement (
    id integer NOT NULL,
    site integer,
    "timestamp" integer,
    urms_l1 real,
    urms_l2 real,
    urms_l3 real,
    irms_l1 real,
    irms_l2 real,
    irms_l3 real,
    irms_n real,
    p_l1 real,
    p_l2 real,
    p_l3 real,
    p_sum real,
    q_l1 real,
    q_l2 real,
    q_l3 real,
    q_sum real,
    s_l1 real,
    s_l2 real,
    s_l3 real,
    s_sum real
);


--
-- Name: measurement_pkey; Type: CONSTRAINT; Schema: public; Owner: -; Tablespace: 
--

ALTER TABLE ONLY measurement
    ADD CONSTRAINT measurement_pkey PRIMARY KEY (id);


--
-- Name: public; Type: ACL; Schema: -; Owner: -
--

REVOKE ALL ON SCHEMA public FROM PUBLIC;
REVOKE ALL ON SCHEMA public FROM postgres;
GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO PUBLIC;


--
-- PostgreSQL database dump complete
--

And then

SELECT id
FROM measurement
GROUP BY id
HAVING COUNT(*) > 1;

yields

 id 
----
(0 rows)

Upvotes: 0

Views: 2468

Answers (1)

Bacon Bits
Bacon Bits

Reputation: 32180

The primary key is a unique constraint on a subset of the fields in your measurement table, while your SELECT DISTINCT returns only unique records from the fields you list, but looks at every field in each record, not just the primary key

That is, you have records which have the same primary key (id, apparently), but have different values in the non-key fields.

You can find keys that have duplicate ids by running this:

SELECT id
FROM t_tmp
GROUP BY id
HAVING COUNT(*) > 1;

And you can display the records related to that by doing this:

SELECT *
FROM t_tmp
WHERE id IN (
    SELECT id
    FROM t_tmp
    GROUP BY id
    HAVING COUNT(*) > 1
);

[Note that I specify t_tmp above, but if you haven't actually run TRUNCATE TABLE measurement; yet that you can use measurement instead.]

Those are the records that have duplicated ids that are causing your key violations, assuming that the key is just on id, which it looks like it is from the error message. You'll need to decide on which one to keep and which one to delete, or otherwise consider updating the id field to a new unique value.

It's not clear if id is tied to a sequence or was created as a SERIAL or BIGSERIAL in your new table. You should really just generate a CREATE TABLE script from pgAdmin to give us the full schema. It's also not clear if there are other unique constraints on the table, either.

Upvotes: 1

Related Questions