software.developer
software.developer

Reputation: 298

MySql insert into select query is too slow to copy 100 million rows

I have one table consisting of 100+ millions rows, and want to copy data into another table. I have 1 requirements, 1. Query execution must not block the other operations to these database tables, I have written a stored procedure as following

I count the number of rows into source table then have a loop but copy 10000 rows in each iterations, start transaction and commit it. then read next 10000 by offset.

CREATE PROCEDURE insert_data()
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE iterations INT DEFAULT 0;
  DECLARE rowOffset INT DEFAULT 0;
  DECLARE limitSize INT DEFAULT 10000;
  SET iterations = (SELECT COUNT(*) FROM Table1) / 10000;

  WHILE i <= iterations DO
    START TRANSACTION;
        INSERT IGNORE INTO Table2(id, field2, field3)
            SELECT f1, f2, f3
            FROM Table1
            ORDER BY id ASC
            LIMIT limitSize offset rowOffset;
    COMMIT;
    SET i = i + 1;
    SET rowOffset = rowOffset + limitSize;
  END WHILE;
END$$
DELIMITER ;

The query executes without locking the tables but after copying few millions rows it has become too slow. Please suggest any better way to do the task. Thanks you!

Upvotes: 4

Views: 6867

Answers (4)

Coopz
Coopz

Reputation: 51

For my setup and needs - I have to copy 300 to 500 million rows. promptly. On a massively under-spec server.

Dump to "csv"; Split result into multiple files (my case 200k rows was optimal) Import split files

SELECT a.b.c INTO OUTFILE '/path/dumpfile.csv'   FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'   LINES TERMINATED BY '\n';

split -a 6 -l 200000 dumpfile.csv FileNamePrefix.  // . allows numbering as ext

for f in ./*;
do 
mysql -uUser -pPassword dbname -e "set autocommit = 0; set unique_checks = 0; set foreign_key_checks = 0; set sql_log_bin=0; LOAD DATA CONCURRENT INFILE '/path/to/split/files/"$f"' IGNORE  INTO TABLE InputTableName FIELDS TERMINATED BY ','  OPTIONALLY ENCLOSED BY '\"'  (a, b, c);commit;"; 
echo "Done: '"$f"' at $(date)"; 
done

Upvotes: 1

software.developer
software.developer

Reputation: 298

Thanks @Bill Karvin I removed the offset as you suggested. Following query worked amazingly well,

DROP PROCEDURE IF EXISTS insert_identifierdataset;
DELIMITER $$
CREATE PROCEDURE insert_data()
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE limitSize INT DEFAULT 2000;
  DECLARE maxId INT DEFAULT 0;

  SET maxId = (SELECT MAX(id) FROM Table1);

  WHILE i <= maxId DO
    START TRANSACTION;
        INSERT IGNORE INTO Table2(id, field1, field2)
            SELECT id, field3, field4
                FROM Table1
                WHERE id> i
                ORDER BY id ASC
                LIMIT limitSize;
    COMMIT;
    SET i = i + limitSize;
  END WHILE;
END$$  

Upvotes: 2

Bill Karwin
Bill Karwin

Reputation: 562398

Any INSERT ... SELECT ... query does acquire a SHARED lock on the rows it reads from the source table in the SELECT. But by processing smaller chunks of rows, the lock doesn't last too long.

The query with LIMIT ... OFFSET is going to be slower and slower as you advance through the source table. At 10,000 rows per chunk, you need to run that query 10,000 times, each one has to start over and scan through the table to reach the new OFFSET.

No matter what you do, copying 100 million rows is going to take a while. It's doing a lot of work.

I would use pt-archiver, a free tool designed for this purpose. It processes the rows in "chunks" (or subsets). It will dynamically adjust the size of the chunks so that each chunk takes 0.5 seconds.

The biggest difference between your method and pt-archiver is that pt-archiver doesn't use LIMIT ... OFFSET, it walks along the primary key index, selecting chunks of row by value instead of by position. So every chunk is read more efficiently.


Re your comment:

I expect that making the batch size smaller — and increasing the number of iterations — will make the performance problem worse, not better.

The reason is that when you use LIMIT with OFFSET, every query has to start over at the start of the table, and count the rows up to the OFFSET value. This gets longer and longer as you iterate through the table.

Running 20,000 expensive queries using OFFSET will take longer than running 10,000 similar queries. The most expensive part will not be reading 5,000 or 10,000 rows, or inserting them into the destination table. The expensive part will be skipping through ~50,000,000 rows, over and over again.

Instead, you should iterate over the table by values not by offsets.

INSERT IGNORE INTO Table2(id, field2, field3)
        SELECT f1, f2, f3
        FROM Table1
        WHERE id BETWEEN rowOffset AND rowOffset+limitSize;

Before the loop, query the MIN(id) and MAX(id), and start rowOffset at the min value, and loop up to the max value.

This is the way pt-archiver works.

Upvotes: 7

Dave Stokes
Dave Stokes

Reputation: 823

Block is the operative word. Hopefully you are using InnoDB (which blocks at the record level) and not MyIsam (Which blocks at the table level). Not know the complexity of the data or the hardware underneath, 10K records per loop may be too big.

Upvotes: 0

Related Questions