Reputation: 101
I'm saving the results from my Tensorflow image classifier in a SQL database. I have 3 tables. Images, Categories and a table joining the two with weight variables. Some images have no relationships and some have lots.
The problem is that I have duplicate rows in the image table that need to be removed. But I need to preserve the many to many relationships, if the duplicated image has one or more.
Here's an example:
Table name: my_images
+----+------------+-----------------+
| ID | image_path | image_filename |
+----+------------+-----------------+
| 1 | Film 1 | Film 1 001.jpg |
| 2 | Film 1 | Film 1 001.jpg |
| 3 | Film 1 | Film 1 002.jpg |
| 4 | Film 1 | Film 1 002.jpg |
| 5 | Film 1 | Film 1 003.jpg |
| 6 | Film 1 | Film 1 003.jpg |
+----+------------+-----------------+
Table name: my_terms
+---------+------------+
| term_id | term_name |
+---------+------------+
| 1 | cat |
| 2 | dog |
| 3 | automobile |
+---------+------------+
Table name: my_term_relationships
+----------+---------+---------+
| image_id | term_id | weight |
+----------+---------+---------+
| 2 | 1 | 0.58516 |
| 2 | 3 | 0.16721 |
| 3 | 2 | 0.21475 |
+----------+---------+---------+
So in this example, the ideal outcome would be to delete rows 1, 4 and either row 5 or 6 from my_images.
Upvotes: 3
Views: 1339
Reputation: 51938
Approach this step by step.
First, find the duplicate entries:
SELECT
image_path, image_filename
FROM my_images
GROUP BY image_path, image_filename
HAVING COUNT(*) > 1
Second, get all rows that are duplicate:
SELECT mi.*
FROM my_images mi
JOIN (
SELECT
image_path, image_filename
FROM my_images
GROUP BY image_path, image_filename
HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
Finally, get the IDs not to delete.
SELECT MIN(ID)
FROM my_images mi
JOIN (
SELECT
image_path, image_filename
FROM my_images
GROUP BY image_path, image_filename
HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id
WHERE mtr.image_id IS NULL
GROUP BY mi.image_path, mi.image_filename
HAVING COUNT(*) > 0
Check if everything is correct. If yes, convert it into a delete statement.
DELETE my_images.* FROM my_images
JOIN (
SELECT MIN(ID) AS ID
FROM my_images mi
JOIN (
SELECT
image_path, image_filename
FROM my_images
GROUP BY image_path, image_filename
HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id
WHERE mtr.image_id IS NULL
GROUP BY mi.image_path, mi.image_filename
HAVING COUNT(*) > 0
) sq USING(ID);
EDIT: to also fix the problem that Logar mentioned, use this UPDATE statement before the DELETE statement.
UPDATE my_term_relationships mtr
JOIN (
SELECT mi.ID, minID
FROM my_images mi
JOIN (
SELECT
image_path, image_filename, MIN(ID) AS minID
FROM my_images
GROUP BY image_path, image_filename
HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
) sq ON mtr.image_id = sq.ID
SET mtr.image_id = sq.minID;
Upvotes: 1
Reputation: 1580
You need to query two set of image ID's, and make a filter with them. Assuming image_path
and image_filename
is UNIQUE together:
my_images
ID, that is not referenced by my_term_relationships
, but a corresponding image_path
+image_filename
might be referenced.image_path
+image_filename
pairs, that is not referenced at all in my_term_relationships
.Take a look at this query:
DELETE FROM my_images
WHERE
ID NOT IN (SELECT DISTINCT image_id FROM my_term_relationships) -- 1
AND
ID NOT IN (SELECT id FROM (
SELECT MIN(ID) as id
FROM my_images
LEFT JOIN my_term_relationships ON ID = image_id
GROUP BY image_path,image_filename
HAVING COUNT(image_id) = 0
) as u_ids -- 2
);
Note, that you must wrap the my_images
table in the DELETE's WHERE clause in a sub-query. Read this thread for explanation: Can't specify target table for update in FROM clause
Example: sqlfiddle
Example update query for removing duplicate rows from my_term_relationships
:
UPDATE my_term_relationships
SET image_id = (
select min(my_images.ID)
from my_images
join my_images as ref_image on (my_images.image_path = ref_image.image_path and my_images.image_filename = ref_image.image_filename)
where ref_image.ID = image_id
);
Upvotes: 0