Gryyphyn
Gryyphyn

Reputation: 31

Need to find and list duplicates between multiple CTEs

I need to identify duplicates on a single table where users have entered the name incorrectly (e.g., instead of putting "John" in fname and "Smith" in lname, a record which already exists, they put "Smith" in fname and "John" in lname). I need to identify these duplicates and list the duplicates and their correct record counterpart.

I have set up two CTEs, one to pull just the columns I need and one to pull the same columns that have the bad info to remove. The primary query pulls all the columns and compares the complete table against the CTE table with the duplicates. I can't do a comparison between the two CTEs becaus it will error (more than one row returned by a subquery used as an expression). In its current form the query returns 9813408908970990872314 results (I gave up at exporting 650k). This is a PostgreSQL database, unknown version, but it's 9+ I believe. There are ~320k valid rows on the person table and 2499 from the wrong_order CTE. The key field is person.number but each will have a unique number, even the wrong entries (every row gets one but not all rows have one due to deleted records that conflict with data retention rules).

WITH
  correct_order AS (
    SELECT (p.lname||', '|| p.fname) AS "name",
    p.number AS "num",
    p.birthdate AS "dob"
    FROM person p
    WHERE p.lname IS NOT NULL
    ),
  wrong_order AS (
    SELECT (p.fname||', '|| p.lname) AS "name",
    p.number AS "num",
    p.birthdate AS "dob"
    FROM person p
    WHERE (p.lname||', '|| p.fname) IN (p.fname||', '|| p.lname)
    )
SELECT
  correct_order.name AS "Correct Name",
  correct_order.num AS "Correct Num",
  correct_order.birthdate AS "Correct DOB",
  wrong_order.name AS "Wrong Name",
  wrong_order.num AS "Wrong Num",
  wrong_order.birthdate AS "Wrong DOB"
FROM
  correct_order, wrong_order
WHERE
  correct_order.name IN (SELECT wrong_order.name FROM wrong_order)

I expected to see name/number/dob from both CTEs where both sets of data match between the two. What I got instead was a duplication where every conceivable combination of those six values. I have tried using a join in the main query (correct_order JOIN wrong_order ON correct_order.name = wrong_order.name) but you can't use = on a string and using IN throws "more than one row...". While it may be possible to do in powershell after the fact I'm not strong enough to lift that without some help.

Upvotes: 0

Views: 154

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246788

You have an unintended cross join in there (the FROM correct_order, wrong_order has no join condition). That explains the astronomical row count.

You need to query along these lines:

SELECTS /* columns you need */
FROM person AS correct
   JOIN person AS wrong
      ON (correct.fname, correct.lname) =
         (wrong.lname, wrong.fname)
WHERE correct.lname IS NOT NULL;

I hope I understood your intention correctly.

There is no problem comparing strings with =, that must be a misunderstanding.

Upvotes: 1

Related Questions