David Apple
David Apple

Reputation: 101

Delete duplicate rows but maintain many to many relationship

I'm saving the results from my Tensorflow image classifier in a SQL database. I have 3 tables. Images, Categories and a table joining the two with weight variables. Some images have no relationships and some have lots.

The problem is that I have duplicate rows in the image table that need to be removed. But I need to preserve the many to many relationships, if the duplicated image has one or more.

Here's an example:

Table name: my_images

+----+------------+-----------------+
| ID | image_path | image_filename  |
+----+------------+-----------------+
|  1 | Film 1     | Film 1 001.jpg  |
|  2 | Film 1     | Film 1 001.jpg  |
|  3 | Film 1     | Film 1 002.jpg  |
|  4 | Film 1     | Film 1 002.jpg  |
|  5 | Film 1     | Film 1 003.jpg  |
|  6 | Film 1     | Film 1 003.jpg  |
+----+------------+-----------------+

Table name: my_terms

+---------+------------+
| term_id | term_name  |
+---------+------------+
|       1 | cat        |
|       2 | dog        |
|       3 | automobile |
+---------+------------+

Table name: my_term_relationships

+----------+---------+---------+
| image_id | term_id | weight  |
+----------+---------+---------+
|        2 |       1 | 0.58516 |
|        2 |       3 | 0.16721 |
|        3 |       2 | 0.21475 |
+----------+---------+---------+

So in this example, the ideal outcome would be to delete rows 1, 4 and either row 5 or 6 from my_images.

Upvotes: 3

Views: 1339

Answers (2)

fancyPants
fancyPants

Reputation: 51938

Approach this step by step.

First, find the duplicate entries:

SELECT 
image_path, image_filename
FROM my_images
GROUP BY image_path, image_filename
HAVING COUNT(*) > 1

Second, get all rows that are duplicate:

SELECT mi.*
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename

Finally, get the IDs not to delete.

SELECT MIN(ID)
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename
HAVING COUNT(*) > 0

Check if everything is correct. If yes, convert it into a delete statement.

DELETE my_images.* FROM my_images 
JOIN (
SELECT MIN(ID) AS ID
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename
HAVING COUNT(*) > 0
) sq USING(ID);

EDIT: to also fix the problem that Logar mentioned, use this UPDATE statement before the DELETE statement.

UPDATE my_term_relationships mtr 
JOIN (
    SELECT mi.ID, minID
    FROM my_images mi
    JOIN (
        SELECT 
        image_path, image_filename, MIN(ID) AS minID
        FROM my_images
        GROUP BY image_path, image_filename
        HAVING COUNT(*) > 1
    ) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
) sq ON mtr.image_id = sq.ID 
SET mtr.image_id = sq.minID;

Upvotes: 1

Laposhasú Acsa
Laposhasú Acsa

Reputation: 1580

You need to query two set of image ID's, and make a filter with them. Assuming image_path and image_filename is UNIQUE together:

  1. All my_images ID, that is not referenced by my_term_relationships, but a corresponding image_path+image_filename might be referenced.
  2. Unique ID's, which belongs to image_path+image_filename pairs, that is not referenced at all in my_term_relationships.

Take a look at this query:

DELETE FROM my_images 
WHERE 
   ID NOT IN (SELECT DISTINCT image_id FROM my_term_relationships) -- 1
   AND 
   ID NOT IN (SELECT id FROM (
    SELECT MIN(ID) as id
    FROM my_images 
    LEFT JOIN my_term_relationships ON ID = image_id
     GROUP BY image_path,image_filename
     HAVING COUNT(image_id) = 0
   ) as u_ids -- 2
);

Note, that you must wrap the my_images table in the DELETE's WHERE clause in a sub-query. Read this thread for explanation: Can't specify target table for update in FROM clause

Example: sqlfiddle


Example update query for removing duplicate rows from my_term_relationships:

UPDATE my_term_relationships
SET image_id = (
  select min(my_images.ID) 
  from my_images 
  join my_images as ref_image on (my_images.image_path = ref_image.image_path and my_images.image_filename = ref_image.image_filename)
  where ref_image.ID = image_id
 );

Upvotes: 0

Related Questions