Duplicate records in SQL

Question

Checking duplicate records by the calculating the hashcode of the table row:

I am uploading excel file into mysql database through user interface before actually inserting the records i have to identify the duplicate records against the records already present in the main table (number of records can be 2500000) and i have to do it as fast as possible.

My solution to this is I am adding one more field in the table say Hashcode and at the time of insertion of records i will calculate the hashcode for each row and at the time of checking the duplicate rows I will match only the hashcode field.If the hashcode exists in the main table then the record is duplicate else it is unique.

Is there any other way to do this in a more faster manner as i am unaware of mysql database capabilities

Thanks!

Benvorth · Accepted Answer

a) if you want to stick to the hash-code use something like

 SELECT sha256(CONCAT(col1, "-", col2, "-", col3)) FROM ...

instead of MD5() - I had some collisions here recently. Dont forget to add an index on the new column.

b) For your purpose a simple

 SELECT col1, col2, col3, Count(*) 
 FROM table 
 GROUP BY 1,2,3 HAVING Count(*) > 1

might be what you want. This will give you 100% duplicates as well and you dont have to add a new hash-col and check for it...

Duplicate records in SQL

Answers (1)

Related Questions