Reputation: 607
Any guess why this statement is taking too long when handling 300 000 rows. Basically this query is meant for finding duplicates.
SELECT DISTINCT
a.Id,
b.Id as sid
FROM
csv_temp a
INNER JOIN
csv_temp b ON a.firstname = b.firstname AND
a.lastname = b.lastname AND
((a.address = b.address) OR
(a.zip = b.zip) OR
(a.city = b.city AND a.state = b.state) )
WHERE
a.Id <> b.Id AND
a.status=2 AND
b.status=1 AND
a.flag !=1 AND
b.flag !=1
Upvotes: 0
Views: 65
Reputation: 21513
ORs often seem to have poor performance, and on JOIN conditions I would expect that to be worse. Try having 3 SELECTs (one for each of the ORed conditions) and UNION the results together. Suspect the DISTINCTS are not required either if this is done:-
SELECT
a.Id,
b.Id as sid
FROM
csv_temp a
INNER JOIN
csv_temp b ON a.firstname = b.firstname AND
a.lastname = b.lastname AND
a.address = b.address
WHERE
a.Id <> b.Id AND
a.status=2 AND
b.status=1 AND
a.flag !=1 AND
b.flag !=1
UNION
SELECT
a.Id,
b.Id as sid
FROM
csv_temp a
INNER JOIN
csv_temp b ON a.firstname = b.firstname AND
a.lastname = b.lastname AND
a.zip = b.zip
WHERE
a.Id <> b.Id AND
a.status=2 AND
b.status=1 AND
a.flag !=1 AND
b.flag !=1
UNION
SELECT
a.Id,
b.Id as sid
FROM
csv_temp a
INNER JOIN
csv_temp b ON a.firstname = b.firstname AND
a.lastname = b.lastname AND
a.city = b.city AND a.state = b.state
WHERE
a.Id <> b.Id AND
a.status=2 AND
b.status=1 AND
a.flag !=1 AND
b.flag !=1
Upvotes: 3
Reputation: 3777
Now check using the Explain Plan after adding indexes on the column which are using in compare
Upvotes: 0