netllama
netllama

Reputation: 381

Eliminating duplicate records returned form a JOIN SQL query on postgresql

I've got a query that is joining data across 4 tables to provide data based on test results. The query is working fine, except for the fact that its returning two identical records for each row of unique data. If I throw a DISTINCT in front of the primary key column (a.id) of one of the tables in the join, that eliminates all the duplicates. However, I've read (and found) that DISTINCT tends to introduce a performance hit, so I'm hoping to find a better performing solution, if possible. Hopefully I'm just doing something silly with my JOINS that is easily fixed. This is on postgresql-9.0.x, and yes I'm aware that if i upgraded to 9.1.x then I could likely do a 'group by a.id', but for now I'm stuck on 9.0.x.

Here's the query:

SELECT a.id,a.suiteid,a.testname
     ,date_trunc('second',a.last_update) AS last_update
     ,regexp_replace(p.relname,E'tests','','g')
     ,o.osname 
FROM smoketests AS a
   , pg_class AS p
   , smoke AS t
   , osversmap AS o 
WHERE a.osversion=o.osversion
  AND a.suiteid=t.id
  AND a.tableoid=p.oid
  AND ( a.current_status='FAILED' )
  AND ( a.arch='i386' )
  AND ( a.os='Darwin' )
  AND a.last_update>'2012-05-01 04:00:00'
  AND a.last_update<'2012-05-02 14:20:45' 
ORDER BY a.id ;

which produces this output:

    id    | suiteid |     testname     |     last_update     | regexp_replace |   osname   
----------+---------+------------------+---------------------+----------------+------------
 32549818 |  668232 | bug377064        | 2012-05-01 08:38:07 | smoke          | OSX-10.7.x
 32549818 |  668232 | bug377064        | 2012-05-01 08:38:07 | smoke          | OSX-10.7.x
 32549819 |  668232 | funcmem_resize   | 2012-05-01 08:38:07 | smoke          | OSX-10.7.x
 32549819 |  668232 | funcmem_resize   | 2012-05-01 08:38:07 | smoke          | OSX-10.7.x
 32549820 |  668232 | leitest          | 2012-05-01 08:38:07 | smoke          | OSX-10.7.x
 32549820 |  668232 | leitest          | 2012-05-01 08:38:07 | smoke          | OSX-10.7.x

The problem is visible in the id column, where there are two of each value returned even though a.id is the unique primary key of the smoke table and doesn't really have duplicates. The 'smoke' table has a one to many relationship with the smoketests table, but I'm still rather confused why I'm getting the duplicates of everything.

Upvotes: 1

Views: 2981

Answers (1)

Chris Travers
Chris Travers

Reputation: 26464

The first thing you shoudl do is stop doing implicit cross-joins. These make tracking down this sort of problem much harder. Rewrite your query as:

SELECT a.id,a.suiteid,a.testname
     ,date_trunc('second',a.last_update) AS last_update
     ,regexp_replace(p.relname,E'tests','','g')
     ,o.osname 
FROM smoketests AS a
JOIN pg_class AS p ON a.tableoid=p.oid
JOIN smoke AS t ON a.suiteid=t.id
JOIN osversmap AS o ON a.osversion=o.osversion
WHERE
  AND ( a.current_status='FAILED' )
  AND ( a.arch='i386' )
  AND ( a.os='Darwin' )
  AND a.last_update>'2012-05-01 04:00:00'
  AND a.last_update<'2012-05-02 14:20:45' 
ORDER BY a.id ;

From there you need to find out what is causing the duplicate. Is smoke causing duplicates? Try returning more records. If so try removing the join and replacing it with an IN subquery.

Upvotes: 3

Related Questions