Martin McGovern
Martin McGovern

Reputation: 31

Mysql - deleting duplicates

I have a table with a barcode column with a unique index. The data has been loaded with additional chars (-xx) at the end of each barcode to prevent duplicates, but there will be lots of duplicates once I strip off the suffix. Here is a sample of the data:

itemnumber  barcode

17912       2-14
18082       2-1
21870       2-10
29219       2-8

Then I created two temporary tables, marty and manny, both with the itemnumber and the stripped down barcodes. So,both tables would contain

itemnumber  barcode

17912       2
18082       2
21870       2
29219       2

etc

And the I tried to delete all but the first entry with barcode '2' in the marty table(and every other barcode). I hoped then to update the original table with the correct first entry and the users could fix up the duplicates themselves in time in the application.

So, this was my query to delete all but the first entry in the marty table for each barcode

DELETE FROM marty
  WHERE itemnumber NOT IN
    (SELECT MIN(itemnumber) FROM manny GROUP BY barcode)

There are 130,000 rows in marty and manny. The query took over 24 hours and then didn't finish properly. The connection to the server crashed and the query did not do all the updates.

Is there a better way to approach this that would not us the subquery, which i think is causing the delay? And the group by is probably slowing things down too with so many records.

Thanks

Upvotes: 3

Views: 2340

Answers (4)

DAlex
DAlex

Reputation: 65

One more variant: this variant works without any temporary tables for deleting duplicates:

 Delete m1
 From Marty m1
 join Marty m2 
    on m1.barcode = m2.barcode 
    and m1.itemnumber > m2.itemnumber

Upvotes: 2

Andy
Andy

Reputation: 17771

MySQL is notoriously slow when using IN with very large sets. A scripted alternative:

Use a script to construct a long itemnumber = X OR itemnumber = y OR itemnumber = z clause (chunks size ~1000) and INSERT the matched rows (i.e. the ones that would not have been DELETEd in your previous query) into a new table, TRUNCATE the existing and load the contents of the new table back into the old with INSERT INTO marty SELECT * FROM marty_tmp.

You may want to lock the table or run in a transaction for the final TRUNCATE, INSERT.

edit:

  • Query SELECT MIN(itemnumber) FROM manny GROUP BY barcode from a script, store results in desiredItemNumbers array
  • Take batches of 1000 desiredItemNumbers and construct this query: INSERT INTO manny_tmp SELECT * FROM manny WHERE itemnumber = desiredItemNumbers[0] OR itemnumber = desiredItemNumbers[1] .... Rerun this query until you've exhausted the desiredItemNumbers array (n.b. the last query will probably have less than 1000 desiredItemNumbers).
  • You now have a table with the results that you would have been left with had you DELETEd the rest, so swap the contents of the marty and marty_tmp tables.
  • TRUNCATE marty
  • INSERT INTO marty SELECT * FROM marty_tmp

Upvotes: 1

gcbenison
gcbenison

Reputation: 11963

Here is a two-stage approach that avoids use of NOT IN. It also does not use the temporary table "manny". First, join "marty" to itself to pick out rows for which itemnumber != min(itemnumber). Use UPDATE to set barcode for these rows to NULL. A second pass with DELETE then removes all rows that were flagged in the first phase.

For this example, I split the barcode column of "marty" into two columns; it could be done with the table in its original format with some modification (need to split the column values on the fly).

select * from marty;
+------------+---------+---------+
| itemnumber | barcode | subcode |
+------------+---------+---------+
|      17912 |       2 |      14 |
|      18082 |       2 |       1 |
|      21870 |       2 |      10 |
|      29219 |       2 |       8 |
|      30133 |       3 |       5 |
|      30134 |       3 |       7 |
|      30139 |       3 |       9 |
|      30142 |       3 |      12 |
+------------+---------+---------+
8 rows in set (0.00 sec)

UPDATE
  (marty m1
   JOIN
     (SELECT barcode,
             MIN(itemnumber) AS itemnumber
      FROM marty
      GROUP BY barcode) m2
   USING(barcode))
SET m1.barcode = NULL WHERE m1.itemnumber != m2.itemnumber;

mysql> select * from marty;
+------------+---------+---------+
| itemnumber | barcode | subcode |
+------------+---------+---------+
|      17912 |       2 |      14 |
|      18082 |    NULL |       1 |
|      21870 |    NULL |      10 |
|      29219 |    NULL |       8 |
|      30133 |       3 |       5 |
|      30134 |    NULL |       7 |
|      30139 |    NULL |       9 |
|      30142 |    NULL |      12 |
+------------+---------+---------+
8 rows in set (0.00 sec)

DELETE FROM marty WHERE barcode IS NULL;

Upvotes: 1

Glenn
Glenn

Reputation: 9150

If you are creating temp tables anyway, how about building your table with an "INSERT INTO " or "CREATE TABLE .. AS ..." based on:

SELECT MIN(itemnumber) AS itemnumber, barcode
  FROM marty
  GROUP BY barcode

Upvotes: 0

Related Questions