Amit Agnihotri
Amit Agnihotri

Reputation: 41

38 Million records - Deleting duplicate rows basis column name and keeping just one

Disclaimer - I've read many fantastic questions & their answers already and have given it a try too. The only issue is that considering the database size, the system gets stuck at "loading" and it just sits there. By looking at the total number of rows, I've seen changes happening but its just not significant when it doesn't give a warning or do it in pieces. I do have a fair idea of tweaking available code and make it work but I'm not a full time/advance developer (yet!)

Question - I've been working on a database that comprises of product information but has duplicate values (was foolish to not make "Product code" column as unique while importing multiple CVS into the database). I need help with removing the duplicate basis the "product code" but want to "keep one" which has maximum information under "specification" column.

Database - MySQL Total records - 36 Million + Total columns - Not more than 15 (but are of less relevance) Problem - Multiple duplicate values basis "product code" but keep one that has maximum number of characters in the "specifications column"

Database details; Table name - pro

Columns names are as; productid - VARCHAR, manPartId - VARCHAR, specification - TEXT

So far, I've picked the following code from and have given it a try, but the system gets stuck at "loading" and nothing happens. I assume this is because of the huge number of records it has.

Code that I had tried running in the phpMyAdmin "SQL" section is;

--------------------------------------------
      delete pro
      from pro
      inner join (
         select max(productid) as lastId, manPartId
           from pro
          group by manPartId
         having count(*) > 1) duplic on duplic.manPartId = pro.manPartId
         where pro.productid < duplic.lastId;
--------------------------------------------

The above code has been tweaked from the original available at MySQL delete duplicate records but keep latest

Please do help and understand on where am I going wrong. Please also note that I do understand that above code only works around "deleting all but keeping one" and is not "keeping one basis total text available in Specification column".

Many thanks in advance!

EDIT - As per recommendations from aendeerei, I've made some edits to the details.

-------------------------------------------------------
productid  | manPartId    |  specification 
-------------------------------------------------------
1            ABC1           5MP camera, 2500 MaH, Steel body
2            ABC2           2MP camera, Steel body
3            ABC3           5MP, 6500 MaH, Red
4            ABC1           2500 MaH, Steel body
5            ABC2           5MP camera, plastic body
6            ABC4           5MP camera, 2500 MaH, Steel body
7            ABC5           15MP camera, 4500 MaH 
8            ABC2           5MP camera
9            ABC3           15MP, 6500 MaH, Blue body
10           ABC5           2500 MaH, Steel body
-------------------------------------------

In the above case, I'm looking at removing the duplicates basis manPartId but want to keep one record that has the maximum (characters) in the specification field.

After running the query, I'd like to see the following updated data having unique manPartId with max text under specification column;

-------------------------------------------------------
productid  | manPartId    |  specification 
---------------------------------------------------------------
1            ABC1           5MP camera, 2500 MaH, Steel body
5            ABC2           5MP camera, plastic body
6            ABC4           5MP camera, 2500 MaH, Steel body
7            ABC5           15MP camera, 4500 MaH, Long life
9            ABC3           15MP, 6500 MaH, Blue body
---------------------------------------------------------------

Please accept my apologies if its still unclear!

Upvotes: 2

Views: 233

Answers (2)

user7941334
user7941334

Reputation:

Well, I can't say much here. Just follow the steps (three + an intermediary one) and read my comments carefully. I have chosen a convenient way for you: to run a simple query per step. It can be done also in other way, e.g. using a stored procedure, or many. But it wouldn't be better for you, since your task is a one-occasion process and a very sensible one. It is better to have the control over all operation results.

You asked me in the comments, what should you use as interface for your task. Well, MySQL Workbench is a good one for such operations, but it breaks/freezes a lot. phpmyadmin? Hm... I use for now SequelPRO and I must say, that I really like it. Can it manage your task? I don't know. But for sure I know one which can: the best MySQL software I ever used - and I'll certainly buy it for personal use too - was SQLyog. An extremely powerful, stable and robust application. Especially when you're dealing with duplicates/exports of databases: it never disappoints.

I saw that you have VARCHAR as data type for the column productid. Make it like this:

`productid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)

And, if you want to never have duplicates in manPartid column, then create a UNIQUE index on it.

I also recommend you to keep a uniform naming convention. Like:

  • productId, or product_id, instead of productid
  • manPartId, or man_part_id, instead of manPartid

And give the name products to the products table.

Now, I structured my answer into two parts: "steps to follow" and "results". For each step I posted the corresponding step results.

Before you begin doing something:

MAKE A BACK-UP OF YOUR DATA!

I wish you good luck!

STEPS TO FOLLOW:

=================================================================
STEP 1:
=================================================================
Create a new table proTmp with the following columns:
- manPartid: definition identical with pro.manPartid
- maxLenSpec: maximum specification length of each pro.manPartid.
=================================================================

    CREATE TABLE `proTmp` (
      `manPartId` varchar(255) DEFAULT NULL,
      `maxLenSpec` bigint(20) DEFAULT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;



===============================================================
STEP 2:
===============================================================
- Truncate table proTmp;
- Get a dataset with all [pro.manPartid, pro.maxLenSpec] pairs;
- Store the dataset into table proTmp.
===============================================================

    TRUNCATE proTmp;
    INSERT INTO proTmp (
        SELECT 
            pro.manPartid
            , MAX(LENGTH(pro.specification)) AS maxLenSpec
        FROM pro
        GROUP BY pro.manPartid
    );



=============================================================
INTERMEDIARY STEP - JUST FOR TEST.
IT ONLY DISPLAYS THE RECORDS WHICH WILL BE DELETED IN STEP 3:
=============================================================
Left join tables pro and proTmp and display only the 
records with pro.lenSpec = proTmp.maxLenSpec.
- lenSpec: length of pro.specification
=============================================================


a) Get pro.*, pro.lenSpec and proTmp.* columns, ordered by pro.manPartid.
_________________________________________________________________________

    SELECT 
        a.*
        , LENGTH(a.specification) as lenSpec
        , b.*
    FROM pro AS a
    LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
    WHERE LENGTH(a.specification) = b.maxLenSpec
    ORDER BY a.manPartid;


b) Get only pro.productid column, ordered by pro.productid.
___________________________________________________________

    SELECT a.productid
    FROM pro AS a
    LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
    WHERE LENGTH(a.specification) = b.maxLenSpec
    ORDER BY a.productid;



====================================================================
STEP 3:
====================================================================
Delete all records from pro having pro.lenSpec != proTmp.maxLenSpec.
IMPORTANT: ordered by pro.productid !!!
====================================================================

    DELETE FROM pro
    WHERE 
        pro.productid NOT IN (
            SELECT a.productid
            FROM (SELECT * FROM pro AS tmp) AS a
            LEFT JOIN proTmp AS b ON b.manPartid = a.manPartid
            WHERE LENGTH(a.specification) = b.maxLenSpec
            ORDER BY a.productid
        );

THE RESULTS OF THE EXECUTED OPERATIONS:

------------------------------------------------------------------------------------------------------------------
NOTA BENE:
------------------------------------------------------------------------------------------------------------------
NOTICE THAT I ADDED A NEW RECORD INTO TABLE pro, WITH THE productid = 11 & manPartid = "ABC1". ITS specification 
COLUMN HAS THE SAME MAXIMUM LENGTH AS THE RECORD WITH THE productid = 1 & manPartid = "ABC1" !!! IN THE END, 
AFTER STEP 3, E:G: AFTER DELETION OF DUPLICATES, BOTH RECORDS SHOULD STILL EXIST IN TABLE pro, BECAUSE THEY BOTH
HAVE THE MAXIMUM LENGTH of specification COLUMN. THEREFORE, THERE WILL STILL EXIST SUCH DUPLICATES IN THE TABLE 
pro AFTER DELETION. IN ORDER TO DECIDE WHICH ONLY ONE OF THESE DUPLICATES SHOULD REMAIN IN THE TABLE, YOU MUST
THINK ABOUT SOME OTHER CONDITIONS AS THE ONES WE KNOW FROM YOU IN THIS MOMENT. BUT, FIRST THINGS FIRST...
SEE ALSO THE RESULTS AFTER RUNNING STEP 3.
------------------------------------------------------------------------------------------------------------------


=================================================================
CREATION SYNTAX AND CONTENT OF TABLE pro, USED BY ME:
=================================================================

CREATE TABLE `pro` (
  `productid` varchar(255) DEFAULT NULL,
  `manPartId` varchar(255) DEFAULT NULL,
  `specification` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


    --------------------------------------------------------
    productid   manPartId   specification
    --------------------------------------------------------
    1           ABC1        5MP camera, 2500 MaH, Steel body
    10          ABC5        2500 MaH, Steel body
    2           ABC2        2MP camera, Steel body
    3           ABC3        5MP, 6500 MaH, Red
    4           ABC1        2500 MaH, Steel body
    5           ABC2        5MP camera, plastic body
    6           ABC4        5MP camera, 2500 MaH, Steel body
    7           ABC5        15MP camera, 4500 MaH
    8           ABC2        5MP camera
    9           ABC3        15MP, 6500 MaH, Blue body
    11          ABC1        12345678901234567890123456789012


===============================================================
STEP 1 - RESULTS: Creation of table proTmp
===============================================================

Just the table proTmp was created, without any content.


===============================================================
STEP 2 - RESULTS: Table proTmp content
===============================================================

    ----------------------
    manPartId   maxLenSpec
    ----------------------
    ABC1        32
    ABC2        24
    ABC3        25
    ABC4        32
    ABC5        21


============================================================
INTERMEDIARY STEP RESULTS - JUST FOR TEST.
IT ONLY DISPLAYS THE RECORDS WHICH WILL BE DELETED IN STEP 3
============================================================


a) Get pro.*, pro.lenSpec and proTmp.* columns, ordered by pro.manPartid.
_________________________________________________________________________

    ----------------------------------------------------------------------------------------------
    productid   manPartId   specification                       lenSpec     manPartId   maxLenSpec
    ----------------------------------------------------------------------------------------------
    1           ABC1        5MP camera, 2500 MaH, Steel body    32          ABC1        32
    11          ABC1        12345678901234567890123456789012    32          ABC1        32
    5           ABC2        5MP camera, plastic body            24          ABC2        24
    9           ABC3        15MP, 6500 MaH, Blue body           25          ABC3        25
    6           ABC4        5MP camera, 2500 MaH, Steel body    32          ABC4        32
    7           ABC5        15MP camera, 4500 MaH               21          ABC5        21


b) Get only pro.productid column, ordered by pro.productid.
___________________________________________________________

    ---------
    productid
    ---------
    1
    11
    5
    6
    7
    9


===========================================================================================
STEP 3 - RESULTS: Table pro after deletion of all duplicates by the two conditions
===========================================================================================

From the log after running the DELETE query:
"No errors, 5 rows affected, taking 6.5 ms"

NOTA BENE: NOTICE THAT THERE ARE STILL TWO RECORDS WITH THE manPartid = "ABC1",
BECAUSE THEY BOTH HAD THE SAME MAXIMUM LENGTH OF THE specification COLUMN !!!

    --------------------------------------------------------
    productid   manPartId   specification
    --------------------------------------------------------
    1           ABC1        5MP camera, 2500 MaH, Steel body
    11          ABC1        12345678901234567890123456789012
    5           ABC2        5MP camera, plastic body
    9           ABC3        15MP, 6500 MaH, Blue body
    6           ABC4        5MP camera, 2500 MaH, Steel body
    7           ABC5        15MP camera, 4500 MaH

I hope it all works.

EDIT 1:

DELETE ALL RECORDS EXCEPT THE ONES WITH MAXIMAL `productid`:

STEP 1: THIS STEP MUST BE APPLIED. OTHERWISE YOU ARE RECEIVING A FALSE LIST OF THE RECORDS TO DELETE!:

Convert column productid from VARCHAR to:

`productid` bigint(20) unsigned NOT NULL

STEP 2: Run the following DELETE query.

DELETE FROM pro
WHERE pro.productid NOT IN (
    SELECT max(b.productid) AS maxPartid
    FROM (SELECT * FROM pro AS a) AS b
    GROUP BY b.manPartid
);

Upvotes: 1

DRapp
DRapp

Reputation: 48139

First, basis, find all parts longest length (Query #1)

SELECT 
      manPartID,
      MAX( CHAR_LENGTH( specification )) longestLength
   from
      pro
   group by
      manPartID

WITH that as the baseline, now look for all parts that have that same longest length. But in the case where there are more than one with the exact same length, you need to pick one, such as the first ProductID or the most recent ProductID to be kept... (Query #2)

SELECT
      p.manPartID,
      MAX( p.productid ) as ProductID
   from
      pro p
         JOIN
            ( Entire Query #1 above ) byLen
          ON p.manPartID = byLen.manPartID
          AND char_length( p.specification ) = byLen.LongestLength
   group by
      p.manPartID

So at this point, you have only one "ProductID" for a single "manPartID" based on the longest specification... Now, you can delete from the main table where it is NOT one of the above such as below. I am doing a LEFT JOIN to the #2 query because I want all records compared and delete only those that are NOT FOUND in the keep result set.

DELETE FROM Pro
   LEFT JOIN (entire query #2 above) Keep
      ON Pro.ProductID = Keep.ProductID
   where Keep.ProductID IS NULL

Now, on a table of 36 million records, you PROBABLY want to ensure the above works before blowing away your data. So instead of deleting, I would create a new secondary table of products and insert into that just to confirm you are getting what you are hoping for...

INSERT INTO SomeTempTable
SELECT p1.*
   from Pro p1
      JOIN ( query #2 above ) Keep
         ON p1.ProductID = Keep.ProductID

Notice this one is a JOIN (not left-join as used in the delete) as I WANT only those products I expect to keep

I am sure there are other elements on the table as described, so to help the query performance, I would have the following index on your "Pro"duct table.

(manPartID, specification, productID)

This way the work can be done off the indexes and not have to go through all the data pages for every record.

Upvotes: 1

Related Questions