Reputation: 423
I want to remove duplicate rows from a table measurement
in a PostgreSQL 9.1 data base.
Some table information:
select column_name, data_type from information_schema.columns where table_name = 'measurement';
column_name | data_type
-------------+-----------
s_sum | real
s_l3 | real
s_l2 | real
s_l1 | real
q_sum | real
q_l3 | real
q_l2 | real
q_l1 | real
p_sum | real
p_l3 | real
p_l2 | real
p_l1 | real
irms_n | real
irms_l3 | real
irms_l2 | real
irms_l1 | real
urms_l3 | real
urms_l2 | real
urms_l1 | real
timestamp | integer
site | integer
id | integer
(22 rows)
and
select count(*) from measurement;
count
----------
56265678
(1 row)
So what I want to do is to remove duplicate rows where all columns except id
are equal. I went ahead and tried this with the approach in this answer.
SET temp_buffers = '1GB';
BEGIN;
CREATE TEMPORARY TABLE t_tmp AS
SELECT DISTINCT site,
timestamp,
urms_l1,
urms_l2,
urms_l3,
irms_l1,
irms_l2,
irms_l3,
irms_n,
p_l1,
p_l2,
p_l3,
p_sum,
q_l1,
q_l2,
q_l3,
q_sum,
s_l1,
s_l2,
s_l3,
s_sum
FROM measurement;
TRUNCATE measurement;
INSERT INTO measurement
SELECT * FROM t_tmp;
COMMIT;
where the echo / error is:
SET
BEGIN
SELECT 56103537
TRUNCATE TABLE
ERROR: duplicate key value violates unique constraint "measurement_pkey"
DETAIL: Key (id)=(1) already exists.
ROLLBACK
so it looks as if it would remove the duplicates alright (compare with number of rows of original table measurement
above) but then a primary key constraint is violated. I do not really know what is going on here, I assume that the INSERT
is not operating on the truncated table...
Update:
The requested sql schema is as follows:
--
-- PostgreSQL database dump
--
SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
--
-- Name: plpgsql; Type: EXTENSION; Schema: -; Owner: -
--
CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog;
--
-- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner: -
--
COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language';
SET search_path = public, pg_catalog;
SET default_tablespace = '';
SET default_with_oids = false;
--
-- Name: measurement; Type: TABLE; Schema: public; Owner: -; Tablespace:
--
CREATE TABLE measurement (
id integer NOT NULL,
site integer,
"timestamp" integer,
urms_l1 real,
urms_l2 real,
urms_l3 real,
irms_l1 real,
irms_l2 real,
irms_l3 real,
irms_n real,
p_l1 real,
p_l2 real,
p_l3 real,
p_sum real,
q_l1 real,
q_l2 real,
q_l3 real,
q_sum real,
s_l1 real,
s_l2 real,
s_l3 real,
s_sum real
);
--
-- Name: measurement_pkey; Type: CONSTRAINT; Schema: public; Owner: -; Tablespace:
--
ALTER TABLE ONLY measurement
ADD CONSTRAINT measurement_pkey PRIMARY KEY (id);
--
-- Name: public; Type: ACL; Schema: -; Owner: -
--
REVOKE ALL ON SCHEMA public FROM PUBLIC;
REVOKE ALL ON SCHEMA public FROM postgres;
GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO PUBLIC;
--
-- PostgreSQL database dump complete
--
And then
SELECT id
FROM measurement
GROUP BY id
HAVING COUNT(*) > 1;
yields
id
----
(0 rows)
Upvotes: 0
Views: 2468
Reputation: 32180
The primary key is a unique constraint on a subset of the fields in your measurement
table, while your SELECT DISTINCT
returns only unique records from the fields you list, but looks at every field in each record, not just the primary key
That is, you have records which have the same primary key (id
, apparently), but have different values in the non-key fields.
You can find keys that have duplicate ids by running this:
SELECT id
FROM t_tmp
GROUP BY id
HAVING COUNT(*) > 1;
And you can display the records related to that by doing this:
SELECT *
FROM t_tmp
WHERE id IN (
SELECT id
FROM t_tmp
GROUP BY id
HAVING COUNT(*) > 1
);
[Note that I specify t_tmp
above, but if you haven't actually run TRUNCATE TABLE measurement;
yet that you can use measurement
instead.]
Those are the records that have duplicated ids that are causing your key violations, assuming that the key is just on id
, which it looks like it is from the error message. You'll need to decide on which one to keep and which one to delete, or otherwise consider updating the id
field to a new unique value.
It's not clear if id
is tied to a sequence or was created as a SERIAL
or BIGSERIAL
in your new table. You should really just generate a CREATE TABLE
script from pgAdmin to give us the full schema. It's also not clear if there are other unique constraints on the table, either.
Upvotes: 1