How to de-duplicate records in MySQL?

Question

This is a hard one. A third party has been sending us data from a fourth party. But they have done that in a horrible format and they messed up and duplicated many of the data.

Now the data is all in one table, even though it should have been in much more than one. This has to do with a historical data format.

Now what SHOULD be each record with multiple related records in other tables, is actually put into our database as follows:

Id HistoricalId Field1 Field2 Field3 Field4 FieldX ... 1 327 2 data data data 3 data data data 4 data data 5 data data 6 328 7 data data data (etc etc)

Everything grossly simplified. So you always first have a sort of "header record". Then records with the data. Until there is a new header. Let's call all the records from one header to the next together a "superrecord" (for instance in the example ID 1 t/m 5 form together the first superrecord, the next superrecord stats at Id 6).

Problem is: there are MANY duplicate "superrecords", easily identified by their duplicate HistoricalId in the header record. But they can be anywhere in the database (the records that form the superrecord will be well sorted and not mixed up, but the superrecords are mixed up).

So the puzzle: remove all duplicate superrecords. We are talking 10.000s here if not more.

So, how would you, in MySQL:

Find a Id from a duplicate superrecord (easy)
Find the Id from the next header record (i.e. the following superrecord)
Delete everything between (and including) the first Id and the second Id minus 1
And do this for all duplicate superrecords.

My head starts spinning. It must be possible with just mySQL, but how? I am just not experienced enough. Even though I am not bad at MySQL, here I cannot even see where to start. Or should I program something in php?

Anyone likes a challenge? Thank you in advance!

UPDATE: Solved it thanks to you and two hours of hard work. See solution.

Roemer · Accepted Answer

I finally solved it. Thanks everyone, you all put me into the right direction.

Three queries are needed:

First mark all duplicate header records by setting HistoricalID to -1

UPDATE 
   t1 INNER JOIN 
     (SELECT MIN(id) AS keep, HistoricalID FROM t1 
      GROUP BY HistoricalID 
      HAVING count(*) > 1 AND HistoricalID > 0) t2 
   ON t1.HistoricalID = t2.HistoricalID 
SET HistoricalID = IF(t1.id=t2.keep, t1.HistoricalID , -1) 
WHERE t1.HistoricalID > 0

Secondly copy HistoricalID from the header record to all other records below it (in the same superrecord). I can undo this later easily if needed.

UPDATE 
  t1 JOIN 
    ( SELECT Id, @s:=IF(HistoricalID='', @s, HistoricalID) HistoricalID FROM 
       (SELECT * FROM t1 ORDER BY Id) r, (SELECT @s:='') t ) t2 
  ON t1.Id = t2.Id 
SET t1.HistoricalID= t2.HistoricalID

Delete all duplicates:

DELETE FROM t1 WHERE HistoricalID = -1

It worked. Couldn't have done it without you!

How to de-duplicate records in MySQL?

Answers (2)

Related Questions