Delete duplicate rows but maintain many to many relationship

Question

I'm saving the results from my Tensorflow image classifier in a SQL database. I have 3 tables. Images, Categories and a table joining the two with weight variables. Some images have no relationships and some have lots.

The problem is that I have duplicate rows in the image table that need to be removed. But I need to preserve the many to many relationships, if the duplicated image has one or more.

Here's an example:

Table name: my_images

+----+------------+-----------------+
| ID | image_path | image_filename  |
+----+------------+-----------------+
|  1 | Film 1     | Film 1 001.jpg  |
|  2 | Film 1     | Film 1 001.jpg  |
|  3 | Film 1     | Film 1 002.jpg  |
|  4 | Film 1     | Film 1 002.jpg  |
|  5 | Film 1     | Film 1 003.jpg  |
|  6 | Film 1     | Film 1 003.jpg  |
+----+------------+-----------------+

Table name: my_terms

+---------+------------+
| term_id | term_name  |
+---------+------------+
|       1 | cat        |
|       2 | dog        |
|       3 | automobile |
+---------+------------+

Table name: my_term_relationships

+----------+---------+---------+
| image_id | term_id | weight  |
+----------+---------+---------+
|        2 |       1 | 0.58516 |
|        2 |       3 | 0.16721 |
|        3 |       2 | 0.21475 |
+----------+---------+---------+

So in this example, the ideal outcome would be to delete rows 1, 4 and either row 5 or 6 from my_images.

fancyPants · Accepted Answer

Approach this step by step.

First, find the duplicate entries:

SELECT 
image_path, image_filename
FROM my_images
GROUP BY image_path, image_filename
HAVING COUNT(*) > 1

Second, get all rows that are duplicate:

SELECT mi.*
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename

Finally, get the IDs not to delete.

SELECT MIN(ID)
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename
HAVING COUNT(*) > 0

see it working live in an sqlfiddle

Check if everything is correct. If yes, convert it into a delete statement.

DELETE my_images.* FROM my_images 
JOIN (
SELECT MIN(ID) AS ID
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename
HAVING COUNT(*) > 0
) sq USING(ID);

sqlfiddle with delete statement: http://sqlfiddle.com/#!9/ed63de/1/0

EDIT: to also fix the problem that Logar mentioned, use this UPDATE statement before the DELETE statement.

UPDATE my_term_relationships mtr 
JOIN (
    SELECT mi.ID, minID
    FROM my_images mi
    JOIN (
        SELECT 
        image_path, image_filename, MIN(ID) AS minID
        FROM my_images
        GROUP BY image_path, image_filename
        HAVING COUNT(*) > 1
    ) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
) sq ON mtr.image_id = sq.ID 
SET mtr.image_id = sq.minID;

see it all working together in this sqlfiddle: http://sqlfiddle.com/#!9/268539/1/0

Delete duplicate rows but maintain many to many relationship

Answers (2)

Related Questions