Reputation: 2065
Let's say I have the following table called Email
, where Id
is the primary key:
+----+------------------+
| Id | Email |
+----+------------------+
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
+----+------------------+
I'm trying to delete all occurrences of duplicates except the first. So in this case the desired output would be
+----+------------------+
| Id | Email |
+----+------------------+
| 1 | [email protected] |
| 2 | [email protected] |
+----+------------------+
After asking a friend, I found this solution works:
DELETE t1 FROM Person t1 INNER JOIN Person t2
Where t1.Email=t2.Email and t1.Id > t2.Id
My question is why does this work? In particular, when t1
inner joins t2
on Email
field, how does the program know which row of [email protected]
should be matched with which, since there are multiple occurrences of this value with different Ids?
Upvotes: 1
Views: 1075
Reputation: 1271003
First, this is more commonly written using aggregation:
DELETE p
FROM Person p INNER JOIN
(SELECT p2.email, MIN(p2.id) as min_id
FROM Person p2
GROUP BY p2.email
) p2
ON p.email = p2.email and p.id > p2.min_id;
Why does your version work? Well, it works because of the fact that a join
not only matches data but also filters data.
So, the condition
t1.Email = t2.Email and t1.Id > t2.Id
Says that for each record in t1
find matching records in t2 where t1.id > t2.id
. That is, find records in t1
that have a matching record with a smaller id
.
All records have this property -- except for one for each email. That would be the record with the smallest id.
I do not recommend this method for identifying the smallest record, because the join
multiplies the number of records. If one email has five records, then there are up to four matches for one of the records. MySQL needs to figure out what to do when you say to delete a single record four times. (It does the right thing, of course, but there is extra work.)
The aggregation method doesn't have any issues like this.
Upvotes: 1
Reputation: 65408
Consider this select statement only filtering by equality among email columns
SELECT t1.*, t2.*
FROM Person t1
INNER JOIN Person t2
WHERE t1.Email=t2.Email
ORDER BY t1.Id, t2.Id;
returns (1,1), (1,3), (3,1), (3,3)
for t1.id and t2.id values respectively for the mail [email protected]
, and only (2,2)
for [email protected]
. Then If you consider the other filter AND t1.Id > t2.Id
,
SELECT t1.*, t2.*
FROM Person t1
INNER JOIN Person t2
WHERE t1.Email=t2.Email
AND t1.id > t2.id
ORDER BY t1.Id, t2.Id;
then you'll only have one tuple (3,1)
since t1.id > t2.id
is satisfied only for this case of id tuples. If you convert SELECT t1.*, t2.*
to DELETE t1
(of course remove ORDER BY
part also), then obviously you'll delete id = 3
and left rows with id values 1
and 2
, reversely if you replace SELECT t1.*, t2.*
with DELETE t2
, then you'll have rows with id values 2
and 3
.
Upvotes: 2
Reputation: 49410
You compare two identical tables and check all occurrences where emailaddress of both tables are identical.
if the id is the same, the row is ignored.
If the id is different and it must have an id that is bigger than the id of first occurrence , this row gets deleted.
Upvotes: 0