Reputation: 583
I've got a database with over 7000 records. As it turns out, there are several duplicates within those records. I found several suggestions on how to delete duplicates and keep only 1 record. But in my case things are a bit more complicated: cases are not simply duplicates if they hold the same data as another record. Instead, several cases ar perfectly okay holding the same data. They are marked as duplicate only when they hold the same data AND are both inserted within 30 seconds.
Therefore I need a SQL statement that deletes duplicates (eg: all fields, except id
and datetime
) if they have been inserted within a 40 seconds range (eg: evaluating the datetime
field).
Since I'm everything but a SQL expert and can't find a suitable solution online, I truly hope some of you might help me out and point me in the right direction. That would be very appreciated!
The table structure is as following:
CREATE TABLE IF NOT EXISTS `wp_ttr_results` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`schoolyear` varchar(10) CHARACTER SET utf8 DEFAULT NULL,
`datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`area` varchar(15) CHARACTER SET utf8 NOT NULL,
`content` varchar(10) CHARACTER SET utf8 NOT NULL,
`types` varchar(100) CHARACTER SET utf8 NOT NULL,
`tasksWrong` varchar(300) DEFAULT NULL,
`tasksRight` varchar(300) DEFAULT NULL,
`tasksData` longtext CHARACTER SET utf8,
`parent_id` varchar(20) DEFAULT NULL,
UNIQUE KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=68696 ;
So just to clarify again, a duplicate case is a case that:
[1]holds the same data as another case for all fields, except the id
and datetime
field
[2]is inserted in the DB, according to the datetime
field, within 40 seconds of another record with the same values
If both conditions are met, all cases except one, should be deleted.
Upvotes: 0
Views: 152
Reputation: 65264
As @Juru pointed out in the comments, we need quite a surgical knive to cut this one. It is however possible to do this in an iterative way via a stored procedure.
First we use a self-join to identify the first duplicate for every record, that itself is not a duplicate:
SELECT DISTINCT
MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
ON base.id<postdups.id
AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
AND base.user_id=postdups.user_id
AND base.schoolyear=postdups.schoolyear
AND base.area=postdups.area
AND base.content=postdups.content
AND base.types=postdups.types
AND base.tasksWrong=postdups.tasksWrong
AND base.tasksRight=postdups.tasksRight
AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
ON base.id>predups.id
AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
AND base.user_id=predups.user_id
AND base.schoolyear=predups.schoolyear
AND base.area=predups.area
AND base.content=predups.content
AND base.types=predups.types
AND base.tasksWrong=predups.tasksWrong
AND base.tasksRight=predups.tasksRight
AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id
;
This selects the lowest id
of all later records (base.id<postdups.id
), that have the same payload as an existing record and are within a 40s window (UNIX_TIMESTAMP(dups.datetime)-UNIX_TIMESTAMP(base.datetime)<40), but skips those base records, that are duplicates themselves. In @Juru's example, the :30
record would be hit, as it is a duplicate of the :00
record, which itself is not a duplicate, but the :41
record would not be hit, as it is a duplicate only to :30
, which itself is a duplicate of :00
.
We have
Now we have to remove this record - since MySQL can't delete from a table it is reading, we must use a variable to achieve that:
CREATE TEMPORARY TABLE cleanUpDuplicatesTemp SELECT DISTINCT
-- as above
;
DELETE FROM wp_ttr_results
WHERE id IN
(SELECT id FROM cleanUpDuplicatesTemp)
;
DROP TABLE cleanUpDuplicatesTemp
;
Until now we will have removed the first duplicate for each record, in the process possibly changing, what would be considered a duplicate ...
Finally we must loop through this process, exiting the loop if the SELECT DISTINCT
returns nothing.
Putting it all together into a stored proceedure:
DELIMITER ;;
CREATE PROCEDURE cleanUpDuplicates()
BEGIN
DECLARE numDuplicates INT;
iterate: LOOP
DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
CREATE TEMPORARY TABLE cleanUpDuplicatesTemp
SELECT DISTINCT
MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
ON base.id<postdups.id
AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
AND base.user_id=postdups.user_id
AND base.schoolyear=postdups.schoolyear
AND base.area=postdups.area
AND base.content=postdups.content
AND base.types=postdups.types
AND base.tasksWrong=postdups.tasksWrong
AND base.tasksRight=postdups.tasksRight
AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
ON base.id>predups.id
AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
AND base.user_id=predups.user_id
AND base.schoolyear=predups.schoolyear
AND base.area=predups.area
AND base.content=predups.content
AND base.types=predups.types
AND base.tasksWrong=predups.tasksWrong
AND base.tasksRight=predups.tasksRight
AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id;
SELECT COUNT(*) INTO numDuplicates FROM cleanUpDuplicatesTemp;
IF numDuplicates<=0 THEN
LEAVE iterate;
END IF;
DELETE FROM wp_ttr_results
WHERE id IN
(SELECT id FROM cleanUpDuplicatesTemp)
END LOOP iterate;
DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
END;;
DELIMITER ;
Now a simple CALL cleanUpDuplicates;
should do the trick.
Upvotes: 3
Reputation: 15941
This might work, but it probably won't be very fast...
DELETE FROM dupes
USING wp_ttr_results AS dupes
INNER JOIN wp_ttr_results AS origs
ON dupes.field1 = origs.field1
AND dupes.field2 = origs.field2
AND ....
AND AS dupes.id <> origs.id
AND dupes.`datetime` BETWEEN orig.`datetime` AND (orig.`datetime` + INTERVAL 40 SECOND)
;
Upvotes: 0