Keith Black
Keith Black

Reputation: 31

SQLite: Remove Duplicates but keep one and non-duplicate fields

I'm having a problem where somehow duplicate rows have made it into the database, where a certain column is duplicate. I've done some research and found posts like this one which works except it deletes ALL rows except for one of the duplicate.

My table structure is like so:

  id    |  hints                        |
  208      episode=1&season=1&show=Name1
  209      episode=1&season=1&show=Name1
  210      episode=1&season=2&show=Name1
  211      episode=1&season=2&show=Name1
  212      episode=3&season=3&show=Name2

As you can see, row ids 208 and 209 are duplicates of each other and 210 and 211 are duplicates of each other as well. However, 212 exists and is NOT a duplicate to any other rows.

I've been able to identify the duplicates by running the following query:

SELECT id, hints FROM media_items GROUP BY hints HAVING count(*) > 1;

My table has 21097 rows, and the above query returns 2309 duplicates. If I run the following query which I got from the link above, all rows EXCEPT the 2309 duplicates are deleted, which is the opposite of what I want to do.

DELETE from media_items
WHERE rowid NOT IN (
    SELECT min(rowid) FROM media_items GROUP BY hints HAVING count(*) > 1
);

I think this solution technically worked for the user that asked it because they didn't need to keep some rows that did not contain duplicates.

I have also tried the following query, which makes sense to me (I'm clearly misunderstanding), but it only deletes 1 row at a time and if I painstakingly run it the amount of times that I have duplicates (2309), it actually goes further and starts deleting non-duplicates.

DELETE FROM 'media_items'
WHERE id = (
    SELECT MIN(id) FROM 'media_items' GROUP BY hints HAVING COUNT(*) > 1
); 

Is this possible?

Upvotes: 3

Views: 707

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269463

If you want to delete all rows of duplicates, then:

DELETE FROM media_items
WHERE EXISTS (SELECT 1
              FROM media_items mi2
              WHERE mi2.hint = media_items.hint AND mi2.id <> media_items.id
             );

If you wanted to keep one of the rows from the duplicates, say the one with the smallest id, I would recommend a correlated subquery:

DELETE FROM media_items
WHERE id > (SELECT MIN(mi2.id)
            FROM media_items mi2
            WHERE mi2.hint = media_items.hint
           );

I strongly advise you not to use NOT IN with subqueries. I should work in this case because I doubt id could ever be NULL. But a single NULL value returned by the subquery will result in no rows being deleted -- even for bona fide duplicates.

Because of this counter-intuitive behavior, use direct comparisons or NOT EXISTS.

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520898

Assuming you always want to retain the smallest id of the duplicates:

DELETE
FROM media_items
WHERE id NOT IN (SELECT MIN(id) FROM media_items GROUP BY hints);

The above subquery finds, for each group of hints (which could just be a single hint), the smallest id for that group. The delete query then spares that minimum id from being deleted, while all others would be deleted.

Upvotes: 3

Related Questions