Delete duplicate values from production big size table with optimized way

Question

Instance details: Aurora 3 with Mysql 8 Table Name: abc with 2 TB size Problem: Table has duplicate values also wanted to add unique key and partitions.

Current Table definition:

CREATE TABLE `abc` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `abcId` int unsigned DEFAULT NULL,
  `createdAt` datetime NOT NULL,
  `qua` decimal(12,7) NOT NULL,
  `timestamp` datetime NOT NULL,
  `otherTimestamp` datetime DEFAULT NULL,
  PRIMARY KEY (`id`,`timestamp`),
  KEY `abcId_idx` (`abcId`),
  KEY `abcId_timestamp_idx` (`abcId`,`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=15590380792 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
/*!50100 PARTITION BY RANGE (year(`timestamp`))
(PARTITION p2017 VALUES LESS THAN (2018) ENGINE = InnoDB,
 PARTITION p2018 VALUES LESS THAN (2019) ENGINE = InnoDB,
 PARTITION p2019 VALUES LESS THAN (2020) ENGINE = InnoDB,
 PARTITION p2020 VALUES LESS THAN (2021) ENGINE = InnoDB,
 PARTITION p202x VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

Some Workarounds:

I will remove KEY abcId_idx (abcId) as we are going to add UniQue index as below UNIQUE KEY abcId_timestamp_idx (abcId,timestamp)
To Delete and change the table defination trying below steps
create tempTable with above unique key and new partitions till 2030
create a sp to insert distinct records in tmpTable with below query

-- execute in loop
INSERT INTO tempTable
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY abcId,timestamp ORDER BY id) AS row_num
    FROM abc
) AS subquery
WHERE subquery.row_num = 1 and (id >= 1 and id <= 10000);



RENAME TABLE abc TO backup_table, tempTable TO abc;

DROP TABLE backup_table;

We tried above on small data on test instance it was taking 288 seconds. So we calculated time on production around 35 hours

So how we sync up with the 35 hours ongoing production data with this new table?

I am aware of the one way of lagging the replication time by 1 or 2 hours but is it doable with the 35 hours too?

Is there any other way to minimize the downtime?

Delete duplicate values from production big size table with optimized way

Answers (0)

Related Questions