Reputation: 31
I have a table with a barcode column with a unique index. The data has been loaded with additional chars (-xx) at the end of each barcode to prevent duplicates, but there will be lots of duplicates once I strip off the suffix. Here is a sample of the data:
itemnumber barcode
17912 2-14
18082 2-1
21870 2-10
29219 2-8
Then I created two temporary tables, marty and manny, both with the itemnumber and the stripped down barcodes. So,both tables would contain
itemnumber barcode
17912 2
18082 2
21870 2
29219 2
etc
And the I tried to delete all but the first entry with barcode '2' in the marty table(and every other barcode). I hoped then to update the original table with the correct first entry and the users could fix up the duplicates themselves in time in the application.
So, this was my query to delete all but the first entry in the marty table for each barcode
DELETE FROM marty
WHERE itemnumber NOT IN
(SELECT MIN(itemnumber) FROM manny GROUP BY barcode)
There are 130,000 rows in marty and manny. The query took over 24 hours and then didn't finish properly. The connection to the server crashed and the query did not do all the updates.
Is there a better way to approach this that would not us the subquery, which i think is causing the delay? And the group by is probably slowing things down too with so many records.
Thanks
Upvotes: 3
Views: 2340
Reputation: 65
One more variant: this variant works without any temporary tables for deleting duplicates:
Delete m1
From Marty m1
join Marty m2
on m1.barcode = m2.barcode
and m1.itemnumber > m2.itemnumber
Upvotes: 2
Reputation: 17771
MySQL is notoriously slow when using IN
with very large sets. A scripted alternative:
Use a script to construct a long itemnumber = X OR itemnumber = y OR itemnumber = z
clause (chunks size ~1000) and INSERT
the matched rows (i.e. the ones that would not have been DELETE
d in your previous query) into a new table, TRUNCATE
the existing and load the contents of the new table back into the old with INSERT INTO marty SELECT * FROM marty_tmp
.
You may want to lock the table or run in a transaction for the final TRUNCATE
, INSERT
.
edit:
SELECT MIN(itemnumber) FROM manny GROUP BY barcode
from a script, store results in desiredItemNumbers array INSERT INTO manny_tmp SELECT * FROM manny WHERE itemnumber = desiredItemNumbers[0] OR itemnumber = desiredItemNumbers[1] ...
. Rerun this query until you've exhausted the desiredItemNumbers array (n.b. the last query will probably have less than 1000 desiredItemNumbers).DELETE
d the rest, so swap the contents of the marty
and marty_tmp
tables.TRUNCATE marty
INSERT INTO marty SELECT * FROM marty_tmp
Upvotes: 1
Reputation: 11963
Here is a two-stage approach that avoids use of NOT IN
. It also does not use the temporary table "manny". First, join "marty" to itself to pick out rows for which itemnumber != min(itemnumber). Use UPDATE
to set barcode
for these rows to NULL
. A second pass with DELETE
then removes all rows that were flagged in the first phase.
For this example, I split the barcode
column of "marty" into two columns; it could be done with the table in its original format with some modification (need to split the column values on the fly).
select * from marty;
+------------+---------+---------+
| itemnumber | barcode | subcode |
+------------+---------+---------+
| 17912 | 2 | 14 |
| 18082 | 2 | 1 |
| 21870 | 2 | 10 |
| 29219 | 2 | 8 |
| 30133 | 3 | 5 |
| 30134 | 3 | 7 |
| 30139 | 3 | 9 |
| 30142 | 3 | 12 |
+------------+---------+---------+
8 rows in set (0.00 sec)
UPDATE
(marty m1
JOIN
(SELECT barcode,
MIN(itemnumber) AS itemnumber
FROM marty
GROUP BY barcode) m2
USING(barcode))
SET m1.barcode = NULL WHERE m1.itemnumber != m2.itemnumber;
mysql> select * from marty;
+------------+---------+---------+
| itemnumber | barcode | subcode |
+------------+---------+---------+
| 17912 | 2 | 14 |
| 18082 | NULL | 1 |
| 21870 | NULL | 10 |
| 29219 | NULL | 8 |
| 30133 | 3 | 5 |
| 30134 | NULL | 7 |
| 30139 | NULL | 9 |
| 30142 | NULL | 12 |
+------------+---------+---------+
8 rows in set (0.00 sec)
DELETE FROM marty WHERE barcode IS NULL;
Upvotes: 1
Reputation: 9150
If you are creating temp tables anyway, how about building your table with an "INSERT INTO " or "CREATE TABLE .. AS ..." based on:
SELECT MIN(itemnumber) AS itemnumber, barcode
FROM marty
GROUP BY barcode
Upvotes: 0