TIMEX
TIMEX

Reputation: 272094

How to find duplicates in MySQL

Suppose I have many columns. If 2 columns match and are exactly the same, then they are duplicates.

ID | title | link | size | author

Suppose if link and size are similar for 2 rows or more, then those rows are duplicates. How do I get those duplicates into a list and process them?

Upvotes: 2

Views: 1245

Answers (4)

unutbu
unutbu

Reputation: 880299

After you remove the duplicates from the MySQL table, you can add a unique index to the table so no more duplicates can be inserted:

create unique index theTable_index on theTable (link,size);

Upvotes: 1

Lance Rushing
Lance Rushing

Reputation: 7640

Will return all records that have dups:

SELECT theTable.*
FROM theTable
INNER JOIN (
  SELECT link, size
  FROM theTable 
  GROUP BY link, size
  HAVING count(ID) > 1
) dups ON theTable.link = dups.link AND theTable.size = dups.size

I like the subquery b/c I can do things like select all but the first or last. (very easy to turn into a delete query then).

Example: select all duplicate records EXCEPT the one with the max ID:

SELECT theTable.*
FROM theTable
INNER JOIN (
  SELECT link, size, max(ID) as maxID
  FROM theTable 
  GROUP BY link, size
  HAVING count(ID) > 1
) dups ON theTable.link = dups.link 
          AND theTable.size = dups.size 
          AND theTable.ID <> dups.maxID

Upvotes: 7

Juha Syrj&#228;l&#228;
Juha Syrj&#228;l&#228;

Reputation: 34281

Assuming that none of id, link or size can be NULL, and id field is the primary key. This gives you the id's of duplicate rows. Beware that same id can be in the results several times, if there are three or more rows with identical link and size values.

select a.id, b.id 
from tbl a, tbl b  
where a.id < b.id   
  and a.link = b.link  
  and a.size = b.size   

Upvotes: 1

Alex Martelli
Alex Martelli

Reputation: 882133

If you want to do it exclusively in SQL, some kind of self-join of the table (on equality of link and size) is required, and can be accompanied by different kinds of elaboration. Since you mention Python as well, I assume you want to do the processing in Python; in that case, simplest is to build an iterator on a 'SELECT * FROM thetable ORDER BY link, size, and process withitertools.groupbyusing, as key, theoperator.itemgetter` for those two fields; this will present natural groupings of each bunch of 1+ rows with identical values for the fields in question.

I can elaborate on either option if you clarify where you want to do your processing and ideally provide an example of the kind of processing you DO want to perform!

Upvotes: 0

Related Questions