wwl
wwl

Reputation: 2065

MySQL inner join with multiple columns

Let's say I have the following table called Email, where Id is the primary key:

+----+------------------+
| Id | Email            |
+----+------------------+
| 1  | [email protected] |
| 2  | [email protected]  |
| 3  | [email protected] |
+----+------------------+

I'm trying to delete all occurrences of duplicates except the first. So in this case the desired output would be

+----+------------------+
| Id | Email            |
+----+------------------+
| 1  | [email protected] |
| 2  | [email protected]  |
+----+------------------+

After asking a friend, I found this solution works:

DELETE t1 FROM Person t1 INNER JOIN Person t2
Where t1.Email=t2.Email and t1.Id > t2.Id

My question is why does this work? In particular, when t1 inner joins t2 on Email field, how does the program know which row of [email protected] should be matched with which, since there are multiple occurrences of this value with different Ids?

Upvotes: 1

Views: 1075

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1271003

First, this is more commonly written using aggregation:

DELETE p
    FROM Person p INNER JOIN
         (SELECT p2.email, MIN(p2.id) as min_id
          FROM Person p2
          GROUP BY p2.email
         ) p2
         ON p.email = p2.email and p.id > p2.min_id;

Why does your version work? Well, it works because of the fact that a join not only matches data but also filters data.

So, the condition

t1.Email = t2.Email and t1.Id > t2.Id

Says that for each record in t1 find matching records in t2 where t1.id > t2.id. That is, find records in t1 that have a matching record with a smaller id.

All records have this property -- except for one for each email. That would be the record with the smallest id.

I do not recommend this method for identifying the smallest record, because the join multiplies the number of records. If one email has five records, then there are up to four matches for one of the records. MySQL needs to figure out what to do when you say to delete a single record four times. (It does the right thing, of course, but there is extra work.)

The aggregation method doesn't have any issues like this.

Upvotes: 1

Barbaros Özhan
Barbaros Özhan

Reputation: 65408

Consider this select statement only filtering by equality among email columns

SELECT t1.*, t2.* 
  FROM Person t1 
 INNER JOIN Person t2
 WHERE t1.Email=t2.Email 
 ORDER BY t1.Id, t2.Id;

returns (1,1), (1,3), (3,1), (3,3) for t1.id and t2.id values respectively for the mail [email protected], and only (2,2) for [email protected]. Then If you consider the other filter AND t1.Id > t2.Id,

SELECT t1.*, t2.* 
  FROM Person t1 
 INNER JOIN Person t2
 WHERE t1.Email=t2.Email 
   AND t1.id > t2.id
 ORDER BY t1.Id, t2.Id;

then you'll only have one tuple (3,1) since t1.id > t2.id is satisfied only for this case of id tuples. If you convert SELECT t1.*, t2.* to DELETE t1 (of course remove ORDER BY part also), then obviously you'll delete id = 3 and left rows with id values 1 and 2, reversely if you replace SELECT t1.*, t2.* with DELETE t2, then you'll have rows with id values 2 and 3.

Demo

Upvotes: 2

nbk
nbk

Reputation: 49410

You compare two identical tables and check all occurrences where emailaddress of both tables are identical.

if the id is the same, the row is ignored.

If the id is different and it must have an id that is bigger than the id of first occurrence , this row gets deleted.

Upvotes: 0

Related Questions