Reputation: 31
I'm having a problem where somehow duplicate rows have made it into the database, where a certain column is duplicate. I've done some research and found posts like this one which works except it deletes ALL rows except for one of the duplicate.
My table structure is like so:
id | hints |
208 episode=1&season=1&show=Name1
209 episode=1&season=1&show=Name1
210 episode=1&season=2&show=Name1
211 episode=1&season=2&show=Name1
212 episode=3&season=3&show=Name2
As you can see, row ids 208 and 209 are duplicates of each other and 210 and 211 are duplicates of each other as well. However, 212 exists and is NOT a duplicate to any other rows.
I've been able to identify the duplicates by running the following query:
SELECT id, hints FROM media_items GROUP BY hints HAVING count(*) > 1;
My table has 21097 rows, and the above query returns 2309 duplicates. If I run the following query which I got from the link above, all rows EXCEPT the 2309 duplicates are deleted, which is the opposite of what I want to do.
DELETE from media_items
WHERE rowid NOT IN (
SELECT min(rowid) FROM media_items GROUP BY hints HAVING count(*) > 1
);
I think this solution technically worked for the user that asked it because they didn't need to keep some rows that did not contain duplicates.
I have also tried the following query, which makes sense to me (I'm clearly misunderstanding), but it only deletes 1 row at a time and if I painstakingly run it the amount of times that I have duplicates (2309), it actually goes further and starts deleting non-duplicates.
DELETE FROM 'media_items'
WHERE id = (
SELECT MIN(id) FROM 'media_items' GROUP BY hints HAVING COUNT(*) > 1
);
Is this possible?
Upvotes: 3
Views: 707
Reputation: 1269463
If you want to delete all rows of duplicates, then:
DELETE FROM media_items
WHERE EXISTS (SELECT 1
FROM media_items mi2
WHERE mi2.hint = media_items.hint AND mi2.id <> media_items.id
);
If you wanted to keep one of the rows from the duplicates, say the one with the smallest id, I would recommend a correlated subquery:
DELETE FROM media_items
WHERE id > (SELECT MIN(mi2.id)
FROM media_items mi2
WHERE mi2.hint = media_items.hint
);
I strongly advise you not to use NOT IN
with subqueries. I should work in this case because I doubt id
could ever be NULL
. But a single NULL
value returned by the subquery will result in no rows being deleted -- even for bona fide duplicates.
Because of this counter-intuitive behavior, use direct comparisons or NOT EXISTS
.
Upvotes: 1
Reputation: 520898
Assuming you always want to retain the smallest id
of the duplicates:
DELETE
FROM media_items
WHERE id NOT IN (SELECT MIN(id) FROM media_items GROUP BY hints);
The above subquery finds, for each group of hints (which could just be a single hint), the smallest id
for that group. The delete query then spares that minimum id
from being deleted, while all others would be deleted.
Upvotes: 3