Dennis Hunink
Dennis Hunink

Reputation: 583

Remove duplicates from MySQL DB

I've got a database with over 7000 records. As it turns out, there are several duplicates within those records. I found several suggestions on how to delete duplicates and keep only 1 record. But in my case things are a bit more complicated: cases are not simply duplicates if they hold the same data as another record. Instead, several cases ar perfectly okay holding the same data. They are marked as duplicate only when they hold the same data AND are both inserted within 30 seconds.

Therefore I need a SQL statement that deletes duplicates (eg: all fields, except id and datetime) if they have been inserted within a 40 seconds range (eg: evaluating the datetime field).

Since I'm everything but a SQL expert and can't find a suitable solution online, I truly hope some of you might help me out and point me in the right direction. That would be very appreciated!

The table structure is as following:

CREATE TABLE IF NOT EXISTS `wp_ttr_results` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` int(11) NOT NULL,
  `schoolyear` varchar(10) CHARACTER SET utf8 DEFAULT NULL,
  `datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `area` varchar(15) CHARACTER SET utf8 NOT NULL,
  `content` varchar(10) CHARACTER SET utf8 NOT NULL,
  `types` varchar(100) CHARACTER SET utf8 NOT NULL,
  `tasksWrong` varchar(300) DEFAULT NULL,
  `tasksRight` varchar(300) DEFAULT NULL,
  `tasksData` longtext CHARACTER SET utf8,
  `parent_id` varchar(20) DEFAULT NULL,
  UNIQUE KEY `id` (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=68696 ;

So just to clarify again, a duplicate case is a case that:

[1]holds the same data as another case for all fields, except the id and datetime field

[2]is inserted in the DB, according to the datetime field, within 40 seconds of another record with the same values

If both conditions are met, all cases except one, should be deleted.

Upvotes: 0

Views: 152

Answers (2)

Eugen Rieck
Eugen Rieck

Reputation: 65264

As @Juru pointed out in the comments, we need quite a surgical knive to cut this one. It is however possible to do this in an iterative way via a stored procedure.

First we use a self-join to identify the first duplicate for every record, that itself is not a duplicate:

SELECT DISTINCT
  MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
  ON base.id<postdups.id
  AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
  AND base.user_id=postdups.user_id
  AND base.schoolyear=postdups.schoolyear
  AND base.area=postdups.area
  AND base.content=postdups.content
  AND base.types=postdups.types
  AND base.tasksWrong=postdups.tasksWrong
  AND base.tasksRight=postdups.tasksRight
  AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
  ON base.id>predups.id
  AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
  AND base.user_id=predups.user_id
  AND base.schoolyear=predups.schoolyear
  AND base.area=predups.area
  AND base.content=predups.content
  AND base.types=predups.types
  AND base.tasksWrong=predups.tasksWrong
  AND base.tasksRight=predups.tasksRight
  AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id
;

This selects the lowest id of all later records (base.id<postdups.id), that have the same payload as an existing record and are within a 40s window (UNIX_TIMESTAMP(dups.datetime)-UNIX_TIMESTAMP(base.datetime)<40), but skips those base records, that are duplicates themselves. In @Juru's example, the :30 record would be hit, as it is a duplicate of the :00 record, which itself is not a duplicate, but the :41 record would not be hit, as it is a duplicate only to :30, which itself is a duplicate of :00.

We have

Now we have to remove this record - since MySQL can't delete from a table it is reading, we must use a variable to achieve that:

CREATE TEMPORARY TABLE cleanUpDuplicatesTemp SELECT DISTINCT 
  -- as above
;
DELETE FROM wp_ttr_results 
WHERE id IN
  (SELECT id FROM cleanUpDuplicatesTemp)
;
DROP TABLE cleanUpDuplicatesTemp
;

Until now we will have removed the first duplicate for each record, in the process possibly changing, what would be considered a duplicate ...

Finally we must loop through this process, exiting the loop if the SELECT DISTINCT returns nothing.

Putting it all together into a stored proceedure:

DELIMITER ;;
CREATE PROCEDURE cleanUpDuplicates()
BEGIN
  DECLARE numDuplicates INT;
  iterate: LOOP
    DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
    CREATE TEMPORARY TABLE cleanUpDuplicatesTemp
    SELECT DISTINCT
      MIN(postdups.id AS id)
    FROM wp_ttr_results AS base
    INNER JOIN wp_ttr_results AS postdups
      ON base.id<postdups.id
      AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
      AND base.user_id=postdups.user_id
      AND base.schoolyear=postdups.schoolyear
      AND base.area=postdups.area
      AND base.content=postdups.content
      AND base.types=postdups.types
      AND base.tasksWrong=postdups.tasksWrong
      AND base.tasksRight=postdups.tasksRight
      AND base.parent_id=postdups.user_id
    LEFT JOIN wp_ttr_results AS predups
      ON base.id>predups.id
      AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
      AND base.user_id=predups.user_id
      AND base.schoolyear=predups.schoolyear
      AND base.area=predups.area
      AND base.content=predups.content
      AND base.types=predups.types
      AND base.tasksWrong=predups.tasksWrong
      AND base.tasksRight=predups.tasksRight
      AND base.parent_id=predups.user_id
    WHERE predups.id IS NULL
    GROUP BY base.id;
    SELECT COUNT(*) INTO numDuplicates FROM cleanUpDuplicatesTemp;
    IF numDuplicates<=0 THEN
      LEAVE iterate;
    END IF;
    DELETE FROM wp_ttr_results 
    WHERE id IN
      (SELECT id FROM cleanUpDuplicatesTemp)
  END LOOP iterate;
  DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
END;;
DELIMITER ;

Now a simple CALL cleanUpDuplicates; should do the trick.

Upvotes: 3

Uueerdo
Uueerdo

Reputation: 15941

This might work, but it probably won't be very fast...

DELETE FROM dupes 
USING wp_ttr_results AS dupes 
   INNER JOIN wp_ttr_results AS origs 
      ON dupes.field1 = origs.field1 
      AND dupes.field2 = origs.field2 
      AND ....
      AND AS dupes.id <> origs.id 
      AND dupes.`datetime` BETWEEN orig.`datetime` AND (orig.`datetime` + INTERVAL 40 SECOND)
;

Upvotes: 0

Related Questions