Herbage Onion
Herbage Onion

Reputation: 181

Find and Delete Duplicate rows in MySQL

I'm having trouble finding duplicates in a database table with the following setup:

==========================================================================
| stock_id  | product_id  | store_id  | stock_qty  | updated_at          |
==========================================================================
| 9990      | 51          | 1         | 13         | 2014-10-25 16:30:01 |
| 9991      | 90          | 2         | 5          | 2014-10-25 16:30:01 |
| 9992      | 161         | 1         | 3          | 2014-10-25 16:30:01 |
| 9993      | 254         | 1         | 18         | 2014-10-25 16:30:01 |
| 9994      | 284         | 2         | 12         | 2014-10-25 16:30:01 |
| 9995      | 51          | 1         | 11         | 2014-10-25 17:30:02 |
| 9996      | 90          | 2         | 5          | 2014-10-25 17:30:02 |
| 9997      | 161         | 1         | 3          | 2014-10-25 17:30:02 |
| 9998      | 254         | 1         | 16         | 2014-10-25 17:30:02 |
| 9999      | 284         | 2         | 12         | 2014-10-25 17:30:02 |
==========================================================================

Stock updates are imported into this table every hour, I'm trying to find duplicate stock entries (any rows which have a matching product id and store id) so I can delete the oldest. The query below is my attempt, by comparing product ids and store ids on a join like this I can find one set of duplicates:

SELECT s.`stock_id`, s.`product_id`, s.`store_id`, s.`stock_qty`, s.`updated_at`
FROM `stock` s
INNER JOIN `stock` j ON s.`product_id`=j.`product_id` AND s.`store_id`=j.`store_id`
GROUP BY `stock_id`
HAVING COUNT(*) > 1
ORDER BY s.updated_at DESC, s.product_id ASC, s.store_id ASC, s.stock_id ASC;

While this query will work, it doesn't find ALL duplicates, only 1 set, which means if an import goes awry and isn't noticed until the morning, there's a possibility that we'll be left with tons of duplicate stock entries. My MySQL skills are sadly lacking and I'm at a complete loss about how to find and delete all duplicates in a fast, reliable manner.

Any help or ideas are welcome. Thanks

Upvotes: 2

Views: 368

Answers (3)

Kruti Patel
Kruti Patel

Reputation: 1472

You can use this query:

DELETE st FROM stock st, stock st2 
WHERE st.stock_id < st2.stock_id AND st.product_id = st2.product_id AND 
st.store_id = st2.store_id;

This query will delete older record having same product_id and store_id and will keep latest record.

Upvotes: 2

Artjoman
Artjoman

Reputation: 286

Or you can use a stored Procedure:

DELIMITER //
DROP PROCEDURE IF EXISTS removeDuplicates;


 CREATE PROCEDURE removeDuplicates(
   stockID INT
 )
 BEGIN


    DECLARE stockToKeep INT;
    DECLARE storeID INT;
    DECLARE productID INT;

 -- gets the store and product value
 SELECT DISTINCT store_id, product_id
 FROM stock
  WHERE stock_id = stockID  
  LIMIT 1
 INTO
  storeID, productID;

 SELECT stock_id
 FROM stock
  WHERE product_id = productID AND store_id = storeID  
  ORDER BY updated_at DESC
  LIMIT 1
 INTO
  stockToKeep;

    DELETE FROM stock 
    WHERE product_id = productID AND store_id = storeID 
    AND stock_id != stockToKeep;
END //
DELIMITER ;

And afterwards call it for every pair of the product id and store id via a cursor procedure: DELIMITER // CREATE PROCEDURE updateTable() BEGIN DECLARE done BOOLEAN DEFAULT FALSE; DECLARE stockID INT UNSIGNED; DECLARE cur CURSOR FOR SELECT DISTINCT stock_id FROM stock; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;

  OPEN cur;

  testLoop: LOOP
    FETCH cur INTO stockID;
    IF done THEN
      LEAVE testLoop;
    END IF;
    CALL removeDuplicates(stockID);
  END LOOP testLoop;

  CLOSE cur;
END//
DELIMITER ;

And then just call the second procedure

CALL updateTable();

Upvotes: 1

Hartmut Holzgraefe
Hartmut Holzgraefe

Reputation: 2765

A self join on store_id, product_id and 'is older' in combination with DISTINCT should give you all rows where also a newer version exists:

> SHOW CREATE TABLE stock;
CREATE TABLE `stock` (
  `stock_id` int(11) NOT NULL,
  `product_id` int(11) DEFAULT NULL,
  `store_id` int(11) DEFAULT NULL,
  `stock_qty` int(11) DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  PRIMARY KEY (`stock_id`)

> select * from stock;
+----------+------------+----------+-----------+---------------------+
| stock_id | product_id | store_id | stock_qty | updated_at          |
+----------+------------+----------+-----------+---------------------+
|        1 |          1 |        1 |         1 | 2001-01-01 12:00:00 |
|        2 |          2 |        2 |         1 | 2001-01-01 12:00:00 |
|        3 |          2 |        2 |         1 | 2002-01-01 12:00:00 |
+----------+------------+----------+-----------+---------------------+


> SELECT DISTINCT s1.stock_id, s1.store_id, s1.product_id, s1.updated_at   
   FROM stock s1   JOIN stock s2     
     ON s1.store_id = s2.store_id    
    AND s1.product_id = s2.product_id    
    AND s1.updated_at < s2.updated_at;
+----------+----------+------------+---------------------+
| stock_id | store_id | product_id | updated_at          |
+----------+----------+------------+---------------------+
|        2 |        2 |          2 | 2001-01-01 12:00:00 |
+----------+----------+------------+---------------------+

> DELETE stock FROM stock 
               JOIN stock s2  ON stock.store_id = s2.store_id  
                             AND stock.product_id = s2.product_id 
                             AND stock.updated_at < s2.updated_at;
Query OK, 1 row affected (0.02 sec)

> select * from stock;
+----------+------------+----------+-----------+---------------------+
| stock_id | product_id | store_id | stock_qty | updated_at          |
+----------+------------+----------+-----------+---------------------+
|        1 |          1 |        1 |         1 | 2001-01-01 12:00:00 |
|        3 |          2 |        2 |         1 | 2002-01-01 12:00:00 |
+----------+------------+----------+-----------+---------------------+

Upvotes: 1

Related Questions