Reputation: 41
Disclaimer - I've read many fantastic questions & their answers already and have given it a try too. The only issue is that considering the database size, the system gets stuck at "loading" and it just sits there. By looking at the total number of rows, I've seen changes happening but its just not significant when it doesn't give a warning or do it in pieces. I do have a fair idea of tweaking available code and make it work but I'm not a full time/advance developer (yet!)
Question - I've been working on a database that comprises of product information but has duplicate values (was foolish to not make "Product code" column as unique while importing multiple CVS into the database). I need help with removing the duplicate basis the "product code" but want to "keep one" which has maximum information under "specification" column.
Database - MySQL Total records - 36 Million + Total columns - Not more than 15 (but are of less relevance) Problem - Multiple duplicate values basis "product code" but keep one that has maximum number of characters in the "specifications column"
Database details; Table name - pro
Columns names are as; productid - VARCHAR, manPartId - VARCHAR, specification - TEXT
So far, I've picked the following code from and have given it a try, but the system gets stuck at "loading" and nothing happens. I assume this is because of the huge number of records it has.
Code that I had tried running in the phpMyAdmin "SQL" section is;
--------------------------------------------
delete pro
from pro
inner join (
select max(productid) as lastId, manPartId
from pro
group by manPartId
having count(*) > 1) duplic on duplic.manPartId = pro.manPartId
where pro.productid < duplic.lastId;
--------------------------------------------
The above code has been tweaked from the original available at MySQL delete duplicate records but keep latest
Please do help and understand on where am I going wrong. Please also note that I do understand that above code only works around "deleting all but keeping one" and is not "keeping one basis total text available in Specification column".
Many thanks in advance!
EDIT - As per recommendations from aendeerei, I've made some edits to the details.
-------------------------------------------------------
productid | manPartId | specification
-------------------------------------------------------
1 ABC1 5MP camera, 2500 MaH, Steel body
2 ABC2 2MP camera, Steel body
3 ABC3 5MP, 6500 MaH, Red
4 ABC1 2500 MaH, Steel body
5 ABC2 5MP camera, plastic body
6 ABC4 5MP camera, 2500 MaH, Steel body
7 ABC5 15MP camera, 4500 MaH
8 ABC2 5MP camera
9 ABC3 15MP, 6500 MaH, Blue body
10 ABC5 2500 MaH, Steel body
-------------------------------------------
In the above case, I'm looking at removing the duplicates basis manPartId but want to keep one record that has the maximum (characters) in the specification field.
After running the query, I'd like to see the following updated data having unique manPartId with max text under specification column;
-------------------------------------------------------
productid | manPartId | specification
---------------------------------------------------------------
1 ABC1 5MP camera, 2500 MaH, Steel body
5 ABC2 5MP camera, plastic body
6 ABC4 5MP camera, 2500 MaH, Steel body
7 ABC5 15MP camera, 4500 MaH, Long life
9 ABC3 15MP, 6500 MaH, Blue body
---------------------------------------------------------------
Please accept my apologies if its still unclear!
Upvotes: 2
Views: 233
Reputation:
Well, I can't say much here. Just follow the steps (three + an intermediary one) and read my comments carefully. I have chosen a convenient way for you: to run a simple query per step. It can be done also in other way, e.g. using a stored procedure, or many. But it wouldn't be better for you, since your task is a one-occasion process and a very sensible one. It is better to have the control over all operation results.
You asked me in the comments, what should you use as interface for your task. Well, MySQL Workbench is a good one for such operations, but it breaks/freezes a lot. phpmyadmin? Hm... I use for now SequelPRO and I must say, that I really like it. Can it manage your task? I don't know. But for sure I know one which can: the best MySQL software I ever used - and I'll certainly buy it for personal use too - was SQLyog. An extremely powerful, stable and robust application. Especially when you're dealing with duplicates/exports of databases: it never disappoints.
I saw that you have VARCHAR
as data type for the column productid
. Make it like this:
`productid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
And, if you want to never have duplicates in manPartid
column, then create a UNIQUE
index on it.
I also recommend you to keep a uniform naming convention. Like:
productId
, or product_id
, instead of productid
manPartId
, or man_part_id
, instead of manPartid
And give the name products
to the products table.
Now, I structured my answer into two parts: "steps to follow" and "results". For each step I posted the corresponding step results.
Before you begin doing something:
I wish you good luck!
=================================================================
STEP 1:
=================================================================
Create a new table proTmp with the following columns:
- manPartid: definition identical with pro.manPartid
- maxLenSpec: maximum specification length of each pro.manPartid.
=================================================================
CREATE TABLE `proTmp` (
`manPartId` varchar(255) DEFAULT NULL,
`maxLenSpec` bigint(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
===============================================================
STEP 2:
===============================================================
- Truncate table proTmp;
- Get a dataset with all [pro.manPartid, pro.maxLenSpec] pairs;
- Store the dataset into table proTmp.
===============================================================
TRUNCATE proTmp;
INSERT INTO proTmp (
SELECT
pro.manPartid
, MAX(LENGTH(pro.specification)) AS maxLenSpec
FROM pro
GROUP BY pro.manPartid
);
=============================================================
INTERMEDIARY STEP - JUST FOR TEST.
IT ONLY DISPLAYS THE RECORDS WHICH WILL BE DELETED IN STEP 3:
=============================================================
Left join tables pro and proTmp and display only the
records with pro.lenSpec = proTmp.maxLenSpec.
- lenSpec: length of pro.specification
=============================================================
a) Get pro.*, pro.lenSpec and proTmp.* columns, ordered by pro.manPartid.
_________________________________________________________________________
SELECT
a.*
, LENGTH(a.specification) as lenSpec
, b.*
FROM pro AS a
LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
WHERE LENGTH(a.specification) = b.maxLenSpec
ORDER BY a.manPartid;
b) Get only pro.productid column, ordered by pro.productid.
___________________________________________________________
SELECT a.productid
FROM pro AS a
LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
WHERE LENGTH(a.specification) = b.maxLenSpec
ORDER BY a.productid;
====================================================================
STEP 3:
====================================================================
Delete all records from pro having pro.lenSpec != proTmp.maxLenSpec.
IMPORTANT: ordered by pro.productid !!!
====================================================================
DELETE FROM pro
WHERE
pro.productid NOT IN (
SELECT a.productid
FROM (SELECT * FROM pro AS tmp) AS a
LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
WHERE LENGTH(a.specification) = b.maxLenSpec
ORDER BY a.productid
);
------------------------------------------------------------------------------------------------------------------
NOTA BENE:
------------------------------------------------------------------------------------------------------------------
NOTICE THAT I ADDED A NEW RECORD INTO TABLE pro, WITH THE productid = 11 & manPartid = "ABC1". ITS specification
COLUMN HAS THE SAME MAXIMUM LENGTH AS THE RECORD WITH THE productid = 1 & manPartid = "ABC1" !!! IN THE END,
AFTER STEP 3, E:G: AFTER DELETION OF DUPLICATES, BOTH RECORDS SHOULD STILL EXIST IN TABLE pro, BECAUSE THEY BOTH
HAVE THE MAXIMUM LENGTH of specification COLUMN. THEREFORE, THERE WILL STILL EXIST SUCH DUPLICATES IN THE TABLE
pro AFTER DELETION. IN ORDER TO DECIDE WHICH ONLY ONE OF THESE DUPLICATES SHOULD REMAIN IN THE TABLE, YOU MUST
THINK ABOUT SOME OTHER CONDITIONS AS THE ONES WE KNOW FROM YOU IN THIS MOMENT. BUT, FIRST THINGS FIRST...
SEE ALSO THE RESULTS AFTER RUNNING STEP 3.
------------------------------------------------------------------------------------------------------------------
=================================================================
CREATION SYNTAX AND CONTENT OF TABLE pro, USED BY ME:
=================================================================
CREATE TABLE `pro` (
`productid` varchar(255) DEFAULT NULL,
`manPartId` varchar(255) DEFAULT NULL,
`specification` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--------------------------------------------------------
productid manPartId specification
--------------------------------------------------------
1 ABC1 5MP camera, 2500 MaH, Steel body
10 ABC5 2500 MaH, Steel body
2 ABC2 2MP camera, Steel body
3 ABC3 5MP, 6500 MaH, Red
4 ABC1 2500 MaH, Steel body
5 ABC2 5MP camera, plastic body
6 ABC4 5MP camera, 2500 MaH, Steel body
7 ABC5 15MP camera, 4500 MaH
8 ABC2 5MP camera
9 ABC3 15MP, 6500 MaH, Blue body
11 ABC1 12345678901234567890123456789012
===============================================================
STEP 1 - RESULTS: Creation of table proTmp
===============================================================
Just the table proTmp was created, without any content.
===============================================================
STEP 2 - RESULTS: Table proTmp content
===============================================================
----------------------
manPartId maxLenSpec
----------------------
ABC1 32
ABC2 24
ABC3 25
ABC4 32
ABC5 21
============================================================
INTERMEDIARY STEP RESULTS - JUST FOR TEST.
IT ONLY DISPLAYS THE RECORDS WHICH WILL BE DELETED IN STEP 3
============================================================
a) Get pro.*, pro.lenSpec and proTmp.* columns, ordered by pro.manPartid.
_________________________________________________________________________
----------------------------------------------------------------------------------------------
productid manPartId specification lenSpec manPartId maxLenSpec
----------------------------------------------------------------------------------------------
1 ABC1 5MP camera, 2500 MaH, Steel body 32 ABC1 32
11 ABC1 12345678901234567890123456789012 32 ABC1 32
5 ABC2 5MP camera, plastic body 24 ABC2 24
9 ABC3 15MP, 6500 MaH, Blue body 25 ABC3 25
6 ABC4 5MP camera, 2500 MaH, Steel body 32 ABC4 32
7 ABC5 15MP camera, 4500 MaH 21 ABC5 21
b) Get only pro.productid column, ordered by pro.productid.
___________________________________________________________
---------
productid
---------
1
11
5
6
7
9
===========================================================================================
STEP 3 - RESULTS: Table pro after deletion of all duplicates by the two conditions
===========================================================================================
From the log after running the DELETE query:
"No errors, 5 rows affected, taking 6.5 ms"
NOTA BENE: NOTICE THAT THERE ARE STILL TWO RECORDS WITH THE manPartid = "ABC1",
BECAUSE THEY BOTH HAD THE SAME MAXIMUM LENGTH OF THE specification COLUMN !!!
--------------------------------------------------------
productid manPartId specification
--------------------------------------------------------
1 ABC1 5MP camera, 2500 MaH, Steel body
11 ABC1 12345678901234567890123456789012
5 ABC2 5MP camera, plastic body
9 ABC3 15MP, 6500 MaH, Blue body
6 ABC4 5MP camera, 2500 MaH, Steel body
7 ABC5 15MP camera, 4500 MaH
I hope it all works.
STEP 1: THIS STEP MUST BE APPLIED. OTHERWISE YOU ARE RECEIVING A FALSE LIST OF THE RECORDS TO DELETE!:
Convert column productid
from VARCHAR
to:
`productid` bigint(20) unsigned NOT NULL
STEP 2: Run the following DELETE
query.
DELETE FROM pro
WHERE pro.productid NOT IN (
SELECT max(b.productid) AS maxPartid
FROM (SELECT * FROM pro AS a) AS b
GROUP BY b.manPartid
);
Upvotes: 1
Reputation: 48139
First, basis, find all parts longest length (Query #1)
SELECT
manPartID,
MAX( CHAR_LENGTH( specification )) longestLength
from
pro
group by
manPartID
WITH that as the baseline, now look for all parts that have that same longest length. But in the case where there are more than one with the exact same length, you need to pick one, such as the first ProductID or the most recent ProductID to be kept... (Query #2)
SELECT
p.manPartID,
MAX( p.productid ) as ProductID
from
pro p
JOIN
( Entire Query #1 above ) byLen
ON p.manPartID = byLen.manPartID
AND char_length( p.specification ) = byLen.LongestLength
group by
p.manPartID
So at this point, you have only one "ProductID" for a single "manPartID" based on the longest specification... Now, you can delete from the main table where it is NOT one of the above such as below. I am doing a LEFT JOIN to the #2 query because I want all records compared and delete only those that are NOT FOUND in the keep result set.
DELETE FROM Pro
LEFT JOIN (entire query #2 above) Keep
ON Pro.ProductID = Keep.ProductID
where Keep.ProductID IS NULL
Now, on a table of 36 million records, you PROBABLY want to ensure the above works before blowing away your data. So instead of deleting, I would create a new secondary table of products and insert into that just to confirm you are getting what you are hoping for...
INSERT INTO SomeTempTable
SELECT p1.*
from Pro p1
JOIN ( query #2 above ) Keep
ON p1.ProductID = Keep.ProductID
Notice this one is a JOIN (not left-join as used in the delete) as I WANT only those products I expect to keep
I am sure there are other elements on the table as described, so to help the query performance, I would have the following index on your "Pro"duct table.
(manPartID, specification, productID)
This way the work can be done off the indexes and not have to go through all the data pages for every record.
Upvotes: 1