Reputation: 9

Optimizing SQL duplicate search

We are working on a table with roughly 13 million rows. Our goal is to look for duplicates within this table for only one restaurant (~300,000 rows). Our criteria for duplicates are same last name, same first 2 letters of first name, and same phone or email. Each of these are their own column. Our strategy for now is to create two identical, temporary tables for all rows from the restaurant and then to join them on the criteria above and then return id, first name, last name, phone, and email from the first table.

SELECT 
    DISTINCT t1.id, t1.firstname, t1.lastname, t1.phone, t1.email
FROM 
(
    SELECT lmoc.id, lmoc.firstname, lmoc.lastname, lmoc.phone, lmoc.email
    FROM loyalty_member_opentable_customer lmoc
    WHERE lmoc.opentable_restaurant_id=2296 
      AND lmoc.lastname NOT LIKE '%Tour%' 
) AS t1
INNER JOIN 
(
    SELECT lmoc2.id, lmoc2.firstname, lmoc2.lastname, lmoc2.phone, lmoc2.email
    FROM loyalty_member_opentable_customer lmoc2
    WHERE lmoc2.opentable_restaurant_id=2296 
      AND lmoc2.lastname NOT LIKE '%Tour%' 
) AS t2 
   ON STRCMP(t1.lastname,t2.lastname)=0 
  AND t1.id!=t2.id 
  AND STRCMP(LEFT(t1.firstname,2),LEFT(t2.firstname,2))=0 
  AND (STRCMP(t1.phone,t2.phone)=0 OR STRCMP(t1.email,t2.email)=0)
ORDER BY t1.lastname, t1.firstname

The issue is that this query takes north of 48 hours to run. Can anybody think of a more efficient way to run this? We need all of the duplicates so that the restaurant could combine them as they see fit.

Upvotes: 0

Answers (3)

ForguesR

Reputation: 3618

You are not creating a temporary table but using subqueries and that will be slow with 13 million rows. Create a single real temporary table with all the data you need (SELECT INTO).

This is what I'd try :

/* Creating a temporary table */
SELECT lmoc.id, lmoc.firstname, lmoc.lastname, lmoc.phone, lmoc.email
INTO tempRestaurant
FROM loyalty_member_opentable_customer AS lmoc
WHERE
  lmoc.opentable_restaurant_id=2296 AND
  lmoc.lastname NOT LIKE '%Tour%' 

/* Select duplicates */
SELECT * FROM 
  tempRestaurant AS t1 
INNER JOIN tempRestaurant AS t2 ON 
  STRCMP(t1.lastname,t2.lastname)=0 
  AND t1.id!=t2.id 
WHERE
  STRCMP(LEFT(t1.firstname,2), LEFT(t2.firstname,2))=0 AND
  ( STRCMP(t1.phone,t2.phone)=0 OR STRCMP(t1.email,t2.email)=0 )

Upvotes: 1

Nevil Jose

Reputation: 11

This SQL will help you find the duplicates

SELECT lmoc.id, lmoc.firstname, lmoc.lastname, lmoc.phone, lmoc.email
FROM loyalty_member_opentable_customer lmoc
WHERE lmoc.opentable_restaurant_id=2296 
  AND lmoc.lastname NOT LIKE '%Tour%' 
  AND lmoc.lastname BETWEEN 'ha' AND 'i'
GROUP BY lmoc.opentable_restaurant_id, lmoc.id, LEFT(lmoc.firstname,2), lmoc.lastname, lmoc.phone, lmoc.email
HAVING COUNT(*) > 1

If you got a primary key, they you can easily keep the recent one and purge the old ones, with this SQL

DELETE 
        lmoc.primary_id
FROM loyalty_member_opentable_customer lmoc
LEFT JOIN
    (SELECT 
        MAX(lmoc.primary_id) AS id
    FROM loyalty_member_opentable_customer lmoc
    WHERE lmoc.opentable_restaurant_id=2296 
        AND lmoc.lastname NOT LIKE '%Tour%' 
        AND lmoc.lastname BETWEEN 'ha' AND 'i'
    GROUP BY lmoc.opentable_restaurant_id, lmoc.id, LEFT(lmoc.firstname,2), lmoc.lastname, lmoc.phone, lmoc.email
    ) nodup 
    ON adjuster.id = nodup.id
WHERE lmoc.opentable_restaurant_id=2296 
        AND lmoc.lastname NOT LIKE '%Tour%' 
        AND lmoc.lastname BETWEEN 'ha' AND 'i'
        AND nodup.id IS NULL";

Upvotes: 1

fancyPants

Reputation: 51888

Why not simply do

SELECT lmoc.lastname, lmoc.firstname, lmoc.phone, lmoc.email
FROM loyalty_member_opentable_customer lmoc
WHERE lmoc.opentable_restaurant_id=2296 
  AND lmoc.lastname NOT LIKE '%Tour%'
GROUP BY lmoc.lastname, LEFT(lmoc.firstname, 2), lmoc.phone, lmoc.email
HAVING COUNT(*) > 1;

Upvotes: 1

Optimizing SQL duplicate search

Answers (3)

Related Questions