Roemer
Roemer

Reputation: 1271

How to de-duplicate records in MySQL?

This is a hard one. A third party has been sending us data from a fourth party. But they have done that in a horrible format and they messed up and duplicated many of the data.

Now the data is all in one table, even though it should have been in much more than one. This has to do with a historical data format.

Now what SHOULD be each record with multiple related records in other tables, is actually put into our database as follows:

Id HistoricalId Field1 Field2 Field3 Field4 FieldX ... 1 327 2 data data data 3 data data data 4 data data 5 data data 6 328 7 data data data (etc etc)

Everything grossly simplified. So you always first have a sort of "header record". Then records with the data. Until there is a new header. Let's call all the records from one header to the next together a "superrecord" (for instance in the example ID 1 t/m 5 form together the first superrecord, the next superrecord stats at Id 6).

Problem is: there are MANY duplicate "superrecords", easily identified by their duplicate HistoricalId in the header record. But they can be anywhere in the database (the records that form the superrecord will be well sorted and not mixed up, but the superrecords are mixed up).

So the puzzle: remove all duplicate superrecords. We are talking 10.000s here if not more.

So, how would you, in MySQL:

  1. Find a Id from a duplicate superrecord (easy)
  2. Find the Id from the next header record (i.e. the following superrecord)
  3. Delete everything between (and including) the first Id and the second Id minus 1
  4. And do this for all duplicate superrecords.

My head starts spinning. It must be possible with just mySQL, but how? I am just not experienced enough. Even though I am not bad at MySQL, here I cannot even see where to start. Or should I program something in php?

Anyone likes a challenge? Thank you in advance!

UPDATE: Solved it thanks to you and two hours of hard work. See solution.

Upvotes: 2

Views: 132

Answers (2)

Roemer
Roemer

Reputation: 1271

I finally solved it. Thanks everyone, you all put me into the right direction.

Three queries are needed:

First mark all duplicate header records by setting HistoricalID to -1

UPDATE 
   t1 INNER JOIN 
     (SELECT MIN(id) AS keep, HistoricalID FROM t1 
      GROUP BY HistoricalID 
      HAVING count(*) > 1 AND HistoricalID > 0) t2 
   ON t1.HistoricalID = t2.HistoricalID 
SET HistoricalID = IF(t1.id=t2.keep, t1.HistoricalID , -1) 
WHERE t1.HistoricalID > 0

Secondly copy HistoricalID from the header record to all other records below it (in the same superrecord). I can undo this later easily if needed.

UPDATE 
  t1 JOIN 
    ( SELECT Id, @s:=IF(HistoricalID='', @s, HistoricalID) HistoricalID FROM 
       (SELECT * FROM t1 ORDER BY Id) r, (SELECT @s:='') t ) t2 
  ON t1.Id = t2.Id 
SET t1.HistoricalID= t2.HistoricalID

Delete all duplicates:

DELETE FROM t1 WHERE HistoricalID = -1

It worked. Couldn't have done it without you!

Upvotes: 0

Darius X.
Darius X.

Reputation: 2937

If you're open to copying to a different table etc., then...

  1. You can figure which records you want to delete. All records where the historical-id exists in some other record with a higher ID

    SELECT id, HISTORICAL_ID FROM tbl t1 WHERE historical_id>0 AND exists (SELECT 1 FROM tbl t2 WHERE T2.hISTORICAL_id=T1.HISTORICAL_ID and T2.ID>T1.ID)

  2. Since each record has an ID, for each record, you could compute the ID of the Header Record. (This is what you mention in your comment). It would be the Max. ID from any "previous" record where historical id is filled in.

    Select ID, HISTORICAL_ID ,(Select MAX(ID) FROM T2 Where T1.ID <T2.ID and T1.HistoricalId<>0) As PARENT_ID From TBL T1

  3. You can then match the PARENT_ID with the first query to get all the IDs you wish to delete

Upvotes: 1

Related Questions