Reputation: 21

Finding multiple duplicates in Postgres

We've had an accident where multiple rows with duplicate values have been inserted into a table, and I need to find which rows in a rather specific format. So far, I have this query:

    SELECT p2.id
    FROM assignmentobject p1, assignmentobject p2
    WHERE ST_Equals(p1.the_geom, p2.the_geom) AND
    p1.id <> p2.id and p1.assignmentid = 15548
    group by p1.id, p2.id

which compares the geometries of the rows and spits it out if it's the same. The IDs are primary keys, and are sequentially created.

However, this presents a problem, as this small segment of the result shows:

p1.id   p2.id
35311   35314
35311   35315
35314   35311
35314   35315
35315   35311
35315   35314

As can be seen here, 35311, 35314, and 35315 have the same geometries, and because of this, all of the combinations between them are included in the result. What I'm aiming to achieve, is having the lowest or highest ID used as the "base", and ignore the other combinations that doesn't involve this "base". I.e., the result shown above would be:

p1.id    p2.id
35311    35314
35311    35315

Here, the combinations between 31314 and 35315 are left out. Is this possible to achieve using pure SQL?

Upvotes: 2

Answers (2)

joop

Reputation: 4503

CREATE TABLE pair (
        ll INTEGER NOT NULL
        , rr INTEGER NOT NULL
        , PRIMARY KEY (ll , rr)
        ) ;

INSERT INTO pair (ll,rr) VALUES
(35311,35314) ,(35311,35315)
,(35314,35311) ,(35314,35315)
,(35315,35311) ,(35315,35314)
        ;

SELECT p1.ll AS p1, p1.rr AS p2
FROM pair p1
WHERE p1.ll < p1.rr -- tie breaker
AND NOT EXISTS (
        SELECT * FROM pair nx
        WHERE nx.ll < nx.rr
        AND nx.rr = p1.ll
        )
        ;

The same with the original geo query packed into a CTE:

WITH pair AS (
  SELECT p1.id AS ll
       , p2.id AS rr
  FROM assignmentobject p1
  JOIN assignmentobject p2 ON ST_Equals(p1.the_geom, p2.the_geom)
                          -- not sure if you want this ...
                          AND p1.assignmentid = p2.assignmentid 
  WHERE p1.id <> p2.id and p1.assignmentid = 15548
  -- group by seems to make no sense here
  -- group by p1.id, p2.id
   )                                                      
SELECT pp.ll AS p1, pp.rr AS p2
FROM pair pp
WHERE pp.ll < pp.rr -- tie breaker
AND NOT EXISTS (
        SELECT * FROM pair nx
        WHERE nx.ll < nx.rr
        AND nx.rr = pp.ll
        )
        ;

Upvotes: 1

Clodoaldo Neto

Reputation: 125204

Just change the <> operator to <

WHERE ST_Equals(p1.the_geom, p2.the_geom) AND
p1.id < p2.id and p1.assignmentid = 15548

If assignmentid is duplicated and you want all the duplicates at once

select p2.id
from
    assignmentobject p1
    inner join
    assignmentobject p2 using(assigmentid)
where
    st_equals(p1.the_geom, p2.the_geom) and
    p1.id < p2.id
group by p1.id, p2.id

Upvotes: 1

Finding multiple duplicates in Postgres

Answers (2)

Related Questions