Reputation: 181
I'm having trouble finding duplicates in a database table with the following setup:
==========================================================================
| stock_id | product_id | store_id | stock_qty | updated_at |
==========================================================================
| 9990 | 51 | 1 | 13 | 2014-10-25 16:30:01 |
| 9991 | 90 | 2 | 5 | 2014-10-25 16:30:01 |
| 9992 | 161 | 1 | 3 | 2014-10-25 16:30:01 |
| 9993 | 254 | 1 | 18 | 2014-10-25 16:30:01 |
| 9994 | 284 | 2 | 12 | 2014-10-25 16:30:01 |
| 9995 | 51 | 1 | 11 | 2014-10-25 17:30:02 |
| 9996 | 90 | 2 | 5 | 2014-10-25 17:30:02 |
| 9997 | 161 | 1 | 3 | 2014-10-25 17:30:02 |
| 9998 | 254 | 1 | 16 | 2014-10-25 17:30:02 |
| 9999 | 284 | 2 | 12 | 2014-10-25 17:30:02 |
==========================================================================
Stock updates are imported into this table every hour, I'm trying to find duplicate stock entries (any rows which have a matching product id and store id) so I can delete the oldest. The query below is my attempt, by comparing product ids and store ids on a join like this I can find one set of duplicates:
SELECT s.`stock_id`, s.`product_id`, s.`store_id`, s.`stock_qty`, s.`updated_at`
FROM `stock` s
INNER JOIN `stock` j ON s.`product_id`=j.`product_id` AND s.`store_id`=j.`store_id`
GROUP BY `stock_id`
HAVING COUNT(*) > 1
ORDER BY s.updated_at DESC, s.product_id ASC, s.store_id ASC, s.stock_id ASC;
While this query will work, it doesn't find ALL duplicates, only 1 set, which means if an import goes awry and isn't noticed until the morning, there's a possibility that we'll be left with tons of duplicate stock entries. My MySQL skills are sadly lacking and I'm at a complete loss about how to find and delete all duplicates in a fast, reliable manner.
Any help or ideas are welcome. Thanks
Upvotes: 2
Views: 368
Reputation: 1472
You can use this query:
DELETE st FROM stock st, stock st2
WHERE st.stock_id < st2.stock_id AND st.product_id = st2.product_id AND
st.store_id = st2.store_id;
This query will delete older record having same product_id
and store_id
and will keep latest record.
Upvotes: 2
Reputation: 286
Or you can use a stored Procedure:
DELIMITER //
DROP PROCEDURE IF EXISTS removeDuplicates;
CREATE PROCEDURE removeDuplicates(
stockID INT
)
BEGIN
DECLARE stockToKeep INT;
DECLARE storeID INT;
DECLARE productID INT;
-- gets the store and product value
SELECT DISTINCT store_id, product_id
FROM stock
WHERE stock_id = stockID
LIMIT 1
INTO
storeID, productID;
SELECT stock_id
FROM stock
WHERE product_id = productID AND store_id = storeID
ORDER BY updated_at DESC
LIMIT 1
INTO
stockToKeep;
DELETE FROM stock
WHERE product_id = productID AND store_id = storeID
AND stock_id != stockToKeep;
END //
DELIMITER ;
And afterwards call it for every pair of the product id and store id via a cursor procedure: DELIMITER // CREATE PROCEDURE updateTable() BEGIN DECLARE done BOOLEAN DEFAULT FALSE; DECLARE stockID INT UNSIGNED; DECLARE cur CURSOR FOR SELECT DISTINCT stock_id FROM stock; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;
OPEN cur;
testLoop: LOOP
FETCH cur INTO stockID;
IF done THEN
LEAVE testLoop;
END IF;
CALL removeDuplicates(stockID);
END LOOP testLoop;
CLOSE cur;
END//
DELIMITER ;
And then just call the second procedure
CALL updateTable();
Upvotes: 1
Reputation: 2765
A self join on store_id
, product_id
and 'is older' in combination with DISTINCT
should give you all rows where also a newer version exists:
> SHOW CREATE TABLE stock;
CREATE TABLE `stock` (
`stock_id` int(11) NOT NULL,
`product_id` int(11) DEFAULT NULL,
`store_id` int(11) DEFAULT NULL,
`stock_qty` int(11) DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
PRIMARY KEY (`stock_id`)
> select * from stock;
+----------+------------+----------+-----------+---------------------+
| stock_id | product_id | store_id | stock_qty | updated_at |
+----------+------------+----------+-----------+---------------------+
| 1 | 1 | 1 | 1 | 2001-01-01 12:00:00 |
| 2 | 2 | 2 | 1 | 2001-01-01 12:00:00 |
| 3 | 2 | 2 | 1 | 2002-01-01 12:00:00 |
+----------+------------+----------+-----------+---------------------+
> SELECT DISTINCT s1.stock_id, s1.store_id, s1.product_id, s1.updated_at
FROM stock s1 JOIN stock s2
ON s1.store_id = s2.store_id
AND s1.product_id = s2.product_id
AND s1.updated_at < s2.updated_at;
+----------+----------+------------+---------------------+
| stock_id | store_id | product_id | updated_at |
+----------+----------+------------+---------------------+
| 2 | 2 | 2 | 2001-01-01 12:00:00 |
+----------+----------+------------+---------------------+
> DELETE stock FROM stock
JOIN stock s2 ON stock.store_id = s2.store_id
AND stock.product_id = s2.product_id
AND stock.updated_at < s2.updated_at;
Query OK, 1 row affected (0.02 sec)
> select * from stock;
+----------+------------+----------+-----------+---------------------+
| stock_id | product_id | store_id | stock_qty | updated_at |
+----------+------------+----------+-----------+---------------------+
| 1 | 1 | 1 | 1 | 2001-01-01 12:00:00 |
| 3 | 2 | 2 | 1 | 2002-01-01 12:00:00 |
+----------+------------+----------+-----------+---------------------+
Upvotes: 1